In March 2019, Neue Zürcher Zeitung (NZZ) launched a new text-to-speech service to the public, as an improved version of its beta audio player released last October. The company shared some of its key lessons learned during the process.
- Google Wavenet is not enough for the Swiss German language
NZZ used Google Wavenet to generate its audio files, and while the technology is talented at languages (currently speaking nine with a natural quality sound), it was not robust enough for the complexities of Swiss German. To solve the problem, NZZ equipped a middleware with a lexicon to flow the words through before being converted into audio.
- Architecture must be mix-and-match friendly
In a changing industry with changing tools, needs and products, NZZ needed to build a service that could easily be adapted to changing circumstances. By building a mix-and-match architecture, they were able to move the service from Amazon Polly to Google Wavenet at short notice, improving dramatically.
- Audio isn’t for everyone
Some people love audio, but others simply do not. NZZ asked both user types to evaluate different text-to-speech engines along with text read by an actual human. Both groups (unsurprisingly) rated the natural human voice the highest. However, that insight wasn’t really relevant beside the fact that users either liked audio or not, and the human voice or lack thereof didn’t seem to have an important influence on usage.
- Techniques for making a written piece into good audio
It’s important that audio content be pleasing to listen to, and NZZ had to use a different layout to transform its articles into quality MP3 format. This involved looking closely at how users would like to have an article read to them, which informed their audio templates.
- Different player experiences present a challenge
NZZ wanted to introduce its audio across all products and on all platforms (desktop, tablets and apps). This meant designed and developing many different player variants to accommodate the technology considerations in different players.
The Bottom Line:
Many variables go into creating a good text-to-speech audio service and must be fully considering during building, testing and beta in order to successfully launch a quality audio content product.