Voice cloning, also known as deepfake voices, and
synthetic voices fall under the broad category of
generative AI. The term is a catch-all phrase for systems that use artificial intelligence to generate brand new content: anything from images, video, text or voices. Arguably the current most famous example of generative AI is Chat-GPT, a free chatbot, launched by OpenAI in November 2022, that uses complex machine learning systems to produce written answers to questions posed by its human users.
Voice cloning falls under the generative AI category because it uses AI to generate new voices that sound exactly like the dataset (from a single voice) it was trained on. Synthetic voices also fall into the category, however the way
the voices are generated using AI vary dramatically, which we’ll explain below.
So now you know where voice cloning sits in the world of generative AI, let's get into what voice cloning actually is.
How does voice cloning work?
Voice cloning is the process of mimicking a real human’s speech using artificial intelligence. Data sets from a single voice are used to train AI systems to copy the speaking style, gender, age and accent of a specific individual’s voice. These AI systems can then output new dialogue that sounds exactly like the original speaker.
Voice cloning uses AI to identify, draw out and replicate the features of a chosen voice. The AI applies these features to future audio files, making it appear an individual has actually voiced the output themselves.The process uses Deep Learning (an advanced form of machine learning) and requires data sets of the original voice to produce convincing results.
What are the benefits of synthetic voices?
Synthetic voices cost less to produce and are inherently suited to scaling. They don’t require extensive, single speaker datasets or come with ethical headaches around voice rights, current and future usage.
At Papercup, we use synthetic voices to produce realistic dialogue for faster, more affordable AI dubbing. Media companies and content distributors use our technology to reach a global audience in a way otherwise unachievable.
Synthetic voices allow our users greater control over the end product, with a range of voices and styles available. Our team of expert translators adapts the translation and spoken delivery of the dubbed audio so that it is indistinguishable from native language speakers.
Are voice clones the same as deepfakes voices?
Those within the machine learning industry will say voice clones and deepfake voices are different, since voice clones aren’t necessarily designed to deceive. However, when it comes to audience perception, voice cloning and deepfakes voices are often perceived as being the same.
Voice cloning’s main purpose is to convince an audience someone has said something they haven’t. Unlike synthetic voices, which build up a bank of data from many different sources to create a new, expressive voice that doesn’t sound like an existing individual, voice cloning builds on an individual’s personal brand and identity. It can be hard to separate the technology from its end use, and there’s no denying voice cloning introduces tricky ethical issues around ownership, licensing rights and audience disclosure.