3 min read

Week 3: Text-to-Speech as Synthetic Media

Our discussion about deepfakes and other synthetic media in class emphasized the discomfort we felt about the implications of such technology becoming commonplace, but I realized there’s at least one kind of AI-generated media that has comfortably settled into our everyday lives: text-to-speech.

Perhaps it’s because we’ve long grown used to the idea of robots reading things for us – from subway announcements to caller-ID – that the leap to virtual assistants, Vocaloid, and other AI voices didn’t seem as grand, or maybe it was successful marketing by the likes of Apple. In any case, it’s now commonplace for us to carry around a device that will speak the words we give it and more.

I have been learning and speaking Japanese for the last six years, and I use Google’s text-to-speech accessibility settings to read text that I have trouble understanding. Japanese uses kanji, characters that can’t be sounded out like an alphabet, and may have different pronunciations depending on when and how they’re used. A common issue I have is understanding the gist of a sentence, based on the meaning of the kanji, but not knowing how to read it. Text-to-speech is a quick and easy way to learn the pronunciation.

Google’s Speech Services is powered by WaveNet, a deep neural network created by the Google subsidiary DeepMind. According to DeepMind, WaveNet voices sound more natural than other text-to-speech (TTS) systems because they learn from and generate speech patterns rather than combining syllables like competing models [link to WaveNet]. It’s unclear how much data is being used today to train WaveNet for TTS, but you can buy time on the model through Google Cloud [link to Google Cloud]. An earlier version of the model, first revealed in a 2016 paper, trained on 24.6 hours of North American English speech data and 34.8 hours of Mandarin Chinese speech data spoken by professional speakers. I can only imagine how many hours of data are being used now.

WaveNet: A generative model for raw audio
This post presents WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.

The TTS page on Google Cloud promises voices that are “near human quality”, as well as the ability to “create a unique voice” for a brand. Searching “realistic synthetic voices” on YouTube also reveals a handful of companies attempting the same. But nothing sounds so convincing that you would think it’s human, which I think has been the appeal of Vocaloid and other voice synthesizing software in music – just as a machine couldn’t approach a human voice, neither could a human imitate a machine.

But voice synthesis doesn’t need to be human-like before it can be harmful. Like any machine learning algorithm, it relies on a vast collection of data, which might be sourced from our interactions with everyday apps and devices or from existing media, not only raising concerns about attribution and our privacy, but also about the biases introduced as the data is labeled and stored. And when this data is stockpiled and traded between businesses – without any of our input – the emergence of disturbing use-cases that perpetuate and profit off of a harmful status quo are a matter of course:

The AI startup erasing call center worker accents: is it fighting bias – or perpetuating it?
A Silicon Valley startup offers voice-altering tech to call center workers around the world: ‘Yes, this is wrong … but a lot of things exist in the world’