Written by Debasish Mitra, VP Engineering, Mihup
Text-to-speech has been a key area of focus for software developers in recent years, especially those working in fields such as developing AI for smart machines, deep-learning and NLP. It is the efforts of researchers in this arena that we are seeing some really impressive results. Usually, a deep neural network is trained from a solitary speaker and several hours of voice data recorded by professionals is used for the purpose. The challenge usually comes if you have to give a new voice to the model created in this manner. Such a change would require a new dataset of voice recordings and the model will need to be retrained, making it a costly affair.
Recently, a three-stage pipeline was developed during research that enabled cloning of a new voice using just a few seconds of reference voice, and without the conventional need to retrain the model. The researchers were able to share extremely natural-sounding outcomes. The plan is now to replicate the success of the research model and open source it to the public. By using a new vocoder model, the goal of the developers is to make the framework capable of operating in real-time. Essentially, the goal is to develop a three-stage deep learning model capable of real-time voice cloning.
The model used in the research is capable of capturing a real-like imitation of the voice sample of only 5 seconds duration. By using a text prompt, it is capable of using any voice extracted for performing text-to-speech functions. The next stage in the development process is to deploy successful deep learning models and the right pipelines for pre-processing information. It will be followed by training the models from several thousand speakers for durations stretching from weeks to months on datasets comprising of tens of thousands of hours of speech. The core aim is to make the system operable in real-time, that is, capturing a voice and generating the speech imitation in lesser time than the time taken to produce the original voice. The framework developed through this process would be able to clone voices it has never heard or been trained on and also to generate speech from text it has never seen before.
MultiSpeaker Text to Speech synthesis system has the ability of producing voice cloning of different users’ voices. However, collecting voice data and training the system for each user is a cumbersome and difficult process, and that’s the primary challenge with the conventional text-to-speech methods.
Speaker Verification to Text to Speech Synthesis
The new approach known as SV2TTS uses three independent components to build an efficient solution to the challenge of multi-speaker adaptation during speech synthesis. These three components are deep learning models which receive training separately from each other.
1. Speaker Encoder
The voice data from each speaker is encoded in an embedding generated by a neural network trained using speaker verification loss. The Speaker verification loss is calculated by predicting whether two voice samples are from the same user or not.
Speaker Embeddings
The embeddings will be highly alike in the scenario of their belonging to the same user. During this training, there is no need to know about the text that is going to be vocalized. The embeddings are agnostic to the downstream task, and it allows them to be trained independently from the synthesis models to follow.
2. Synthesizer
Synthesizer is the core component of the Text-to-Speech Synthesis. The sequence of phonemes is taken as inputs to generate a spectrogram of the corresponding text input. Phonemes are tiny units of a sound of words. Each word gets broken down into the phonemes and sequence input is created for the model. This model also requires Speaker encodings to support Multi-Speaker Voices.
However, rebuilding audio from the available spectrograms is not as easy a task as creating the spectrograms from the audio samples. This is where the generation of audio requires the following vocoder network.
3. Vocoder
A sample-by-sample autoregressive WaveNet model is used to perform voice generation. In this model, Mel Spectrogram is taken as an input to generate time-domain waveforms.
Real Time Voice Cloning Application
As described by its authors, the Real-Time Voice Cloning Application (SV2TTS) is a three-stage deep learning framework capable of generating numerical representations of a voice by using only a few seconds of audio and use it to condition a text-to-speech model trained to generalize to new voices.
The resolution of text-to-speech synthesis challenges through real-time voice cloning can offer a number of benefits such as reading out pdfs aloud, helping the visually impaired people interact with text with greater ease and making the chatbots more interactive and faster etc. The new development promises all that and might just be the transformation that the TTS researchers had been searching for!