We’re entering a world where many foundational machine learning models are trained on vast amounts of data . To understand how much data, take Llama3, which was trained on 15 trillion data tokens, equivalent to 54.5 billion A4 pages. This is the amount of data required to give models emergent capabilities. Large language models (LLMs) are trained to predict only the next word, but in that process, they learn humour, chemistry, math and nuances of cultures. If we treat an LLM as a black box, the input is some text (your question to ChatGPT), and the output is some other text (ChatGPT’s answer).
Transferring that concept to the field of TTS (text-to-speech) systems, we have an optional additional input and an additional output, which are the source audio and the output audio, respectively. For example, the source audio could refer to audio in an original language in the context of dubbing. Just as in the LLM world, in the audio world, TTS models are getting bigger so that they can tackle more nuanced tasks (whether that’s dubbing, sound effects or music generation) and capture more nuance (emotions, hesitations, prosody) in the source audio, that is the content our customer wants to dub into another language using AI dubbing.
The previous generation of models had multiple components: some that were aided by machine learning and some which were rule-based.
While these system architectures served their purpose, and allowed us to produce natural sounding speech, they struggle to produce an extended expressive range (like laughter and disfluencies) , or the generated speech ends up sounding robotic. Some examples of such speech:
Tortoise TTS, architectural notes from the author (presumably with an iPad in bed)
In the last couple of years, there has been a large speech modelling boom. Tortoise TTS—implemented by a person sitting in his bedroom with GPUs sponsored by NVIDIA—is considered seminal in this domain. It showed us that with simple but scalable models and large amounts of clean data, some powerful models could be developed. Major research departments like Microsoft Research also designed and trained powerful models around the same time - such as Vall-E. These next-generation TTS models unlocked a new range of emotion and expressivity in speech.
Listen to an example from Tortoise V2 below. More can be found here.
These new generations of GPT-powered large models are beautiful (to the trained eye). They scale elegantly, but they’re also extremely data-hungry. The speech community indicates that a typical from-scratch training experiment requires between tens of thousands and approximately 100,000 hours of speech data.
It’s not just the data size that matters; it’s the quality. We've seen instances in which the performance of smaller models trained on the right quality data is as high as that of much bigger models. On the flip side, we’ve also seen that pre-trained models can be fine-tuned (An OpenAI FAQ section on finetuning) with a few iterations and small amounts of high-quality data.
This brings up interesting challenges in speech. "Unclean” data is often the most interesting. The following examples would be regarded as “unclean” when, in actual fact, they give our models a truer picture of the “messiness” of human communication. To produce realistic synthetic speech, we need our models to learn the full breadth of human communication, including:
These scenarios are extremely valuable for a model to capture and perhaps even output. How cool would it be for a model to switch between 3 languages, like, when I talk to my parents?
To produce the best results, we work with large amounts of high-quality data. In the speech world, high-quality data is highly curated studio recordings made by voice actors. However, this high-quality data tier does have downsides.
In order to tackle both these problems, we look to the other two tiers of data:
All these tiers of data have their place in training a single model.
Therefore, before embarking on a large model training plan, we plan how much data we need in each of the above buckets.
Large data comes with large data problems. Fundamentally, there are several nuances when dealing with large amounts of data across many datasets that represent a large quality range. As a rule of thumb, you can assume that a very small percentage of the data one might process can actually be used for model training (high-quality data), i.e the yield is low. This yield can be anywhere from 10% to 90% depending on the data source.
Processing large amounts of data is inefficient if downstream steps require human intervention. We don’t want a system where a human manually triggers various processing steps, and needs to know about how the data was processed in order to use. Therefore, it is essential to fully automate one’s data pipeline to get the most out of the system. The data pipeline needs to do things like:
The above steps are like a series of giant filters. These filters serially reject a lot of data, thereby creating the concept of pipeline yield that we mentioned above. At the end of this pipeline, we expect to arrive at high-quality, well-represented, expressive datasets across multiple languages and domains. This is also the secret sauce with which we enrich existing, open-source, publicly available data.
Just as a series of giant filters help us to overcome the problem of automation, we must tackle the following to overcome the challenge of processing large amounts of data.
Storage and systems: The system design for a large scale data pipeline needs to be intentional from the outset.
Here is a nice blog post about some of the concepts mentioned above.
Compute: To turn our automated pipeline towards large amounts of data, it’s important for us to be able to do that at scale. Therefore a high throughput data pipeline orchestration system is required, at two levels → a) to orchestrate the functional interaction with large amounts of compute and b) to orchestrate the various stages in the automated pipeline. While doing this, it’s important for us to right size our compute for each functional stage.
Cost: All this storage, compute and network costs money. So, here’s how we manage that:
Solving the scale problems is one thing, but the next thing is to plan for it.
As we move into a future where speech models are capable of capturing and reproducing even the most nuanced aspects of human communication, from hesitations to humor, the quality and quantity of the data used to train these models become crucial. At Papercup, we embrace this challenge head-on, harnessing vast, high-quality data and refining our automated pipelines to deliver human-sounding AI dubbing that’s more than just accurate—it’s expressive. Our approach allows us to seamlessly bridge language barriers while preserving the emotional and cultural nuances that make content truly resonate with audiences worldwide.
Stay up to date with the latest news and updates.