Feeding the Tacotron — embedding prosody in the prosaic.
Google Research has just published some impressive results in the field of speech synthesis using neural networks to deliver a truly human-like voice. Check out the sound clips: https://research.googleblog.com/2018/03/expressive-speech-synthesis-with.html?m=1
And it reminded me of the crude speech synthesizer I built exactly 30 years ago, below.
But this is getting really good…. making voice shifting and FakeSpeech almost a turnkey task. What to do? Might we need to embed high-frequency private key signatures in our voice recordings for authentication? My son built a pretty good autoencoder to swap his face onto other people in YouTube videos. If kids can do it for fun, imagine the professionals….
More speech demos: https://google.github.io/tacotron/publications/global_style_tokens/
And the full paper: https://arxiv.org/pdf/1803.09017.pdf
“To deliver true human-like speech, a Text-To-Speech system must learn to model prosody. Prosody is the confluence of a number of phenomena in speech, such as paralinguistic informa- tion, intonation, stress, and style. In this work we focus on style modeling, the goal of which is to provide models the capability to choose a speaking style appropriate for the given context. While difficult to define precisely, style contains rich information, such as intention and emotion, and influences the speaker’s choice of intonation and flow.”
And where did they get the training data sets? “Audio tracks mined from 439 official TED YouTube channel videos.”

Leave a Reply