What is voice cloning?

Voice cloning, also known as deepfake voices, and synthetic voices fall under the broad category of generative AI. The term is a catch-all phrase for systems that use artificial intelligence to generate brand new content: anything from images, video, text or voices. Arguably the current most famous example of generative AI is Chat-GPT, a free chatbot, launched by OpenAI in November 2022, that uses complex machine learning systems to produce written answers to questions posed by its human users. Voice cloning falls under the generative AI category because it uses AI to generate new voices that sound exactly like the dataset (from a single voice) it was trained on. Synthetic voices also fall into the category, however the way the voices are generated using AI vary dramatically, which we’ll explain below. So now you know where voice cloning sits in the world of generative AI, let's get into what voice cloning actually is.

How does voice cloning work?

Voice cloning is the process of mimicking a real human’s speech using artificial intelligence. Data sets from a single voice are used to train AI systems to copy the speaking style, gender, age and accent of a specific individual’s voice. These AI systems can then output new dialogue that sounds exactly like the original speaker.
Voice cloning uses AI to identify, draw out and replicate the features of a chosen voice. The AI applies these features to future audio files, making it appear an individual has actually voiced the output themselves.The process uses Deep Learning (an advanced form of machine learning) and requires data sets of the original voice to produce convincing results.

What is voice cloning used for?

Voice cloning is used when audio output needs to match an individual speaker. The voice. produced on a single data thumbprint, can be used in the following ways:
  1. To preserve the recognizable voice of a particular actor when dubbing film or TV audio. For example in the case of James Earl Jones, the iconic voice of Darth Vader
2. Training AI on datasets of voice actors and public figures who are unavailable, or even deceased, and therefore unable to contribute to a project.
Individuals looking to grow their personal brands, such as influencers or YouTubers, might also turn to voice cloning to lend their (recognizable) voice to more projects, despite busy schedules. They may also want to clone their voice in other languages to reach new global audiences by dubbing existing content. However, our research shows that audiences in this case are more concerned with the accuracy of the translation and how realistic the voices sound than hearing an exact clone of the original voice.

Read our AI dubbing myths report for more on audience expectations

3. The third use of voice cloning is when localization companies, like Papercup, clone the voice of an actor, and use that voice to dub existing content. The cloned voice is not used to replicate the original voice, but as a new dubbed voice.

What are the alternatives to voice cloning?

Voice cloning is just one way to produce human-sounding speech using AI. It’s possible to create natural-sounding speech using a library of synthetic voices. These synthetic voices are created from lots of datasets, as opposed to speech data from one existing individual, and still sound entirely human. At Papercup, we use multiple datasets that are optimized for video content — for instance from premium content studios or voice actors — to create a library of highly expressive AI voices that are best for engaging audiences. Using a library of voices enables AI dubbing at a scale not previously possible. They are also subject to fewer ethical questions around consent, trust and transparency that come with voice cloning.
alternatives

How are synthetic voices created?

Papercup’s machine learning system uses AI to produce a library of human-sounding voices. These voices capture the nuances and expressivity of the original speech without outright cloning the material. Native speakers often can’t tell the difference between Papercup’s AI voices and human ones. Papercup technology has three engines: one that creates a script from video and translates it, one that generates dubbed audio in new languages using our expressive AI voices, and our human-in-the-loop process in which real translators check and adjust the tone, pronunciation and delivery of the dubbed audio.

When are the different styles of speech technology used?

Voice cloning comes with ethical considerations — namely around consent and usage, which makes it ethically more challenging than using AI voices that pull from multiple datasets to create lifelike expression. When used, voice cloning is often an individual preference or a one-off major expense due to unforeseen circumstances. It can be expensive and time-intensive and comes with a host of ethical considerations. Synthetic voices have a wider range of applications. They make it possible to localize and dub videos at a scale unachievable through traditional dubbing methods while retaining audience trust. Uses include:
  • Dubbing factual video content such as sports, news, and documentaries into new languages at scale,
  • Increasing digital media reach in new international markets.
Voice cloning is best suited to projects that require a specific individual’s identifiable speech. For other use cases, AI dubbing is faster, simpler, and better suited to audience expectations.

What are the benefits of synthetic voices?

Synthetic voices cost less to produce and are inherently suited to scaling. They don’t require extensive, single speaker datasets or come with ethical headaches around voice rights, current and future usage. At Papercup, we use synthetic voices to produce realistic dialogue for faster, more affordable AI dubbing. Media companies and content distributors use our technology to reach a global audience in a way otherwise unachievable. Synthetic voices allow our users greater control over the end product, with a range of voices and styles available. Our team of expert translators adapts the translation and spoken delivery of the dubbed audio so that it is indistinguishable from native language speakers.

Are voice clones the same as deepfakes voices?

Those within the machine learning industry will say voice clones and deepfake voices are different, since voice clones aren’t necessarily designed to deceive. However, when it comes to audience perception, voice cloning and deepfakes voices are often perceived as being the same. Voice cloning’s main purpose is to convince an audience someone has said something they haven’t. Unlike synthetic voices, which build up a bank of data from many different sources to create a new, expressive voice that doesn’t sound like an existing individual, voice cloning builds on an individual’s personal brand and identity. It can be hard to separate the technology from its end use, and there’s no denying voice cloning introduces tricky ethical issues around ownership, licensing rights and audience disclosure.

Is voice cloning ethical?

Voice cloning is not outright unethical, but serious consideration has to go into safeguarding. Does the original speaker retain rights over what their voice says? What happens in cases where the speaker can’t consent, such as if they die? Even if a voice actor has signed away their rights, does this change if the voice is used to espouse controversial views? Should limited use cases be agreed ahead of time? There are further considerations when it comes to audiences. Does it break consumer rights if audiences don't know they’re listening to a deepfake voice? Will it erode business trust if individuals feel duped? Examples such as the cloning of Antony Bourdain’s voice after his death for use in a Netflix documentary how thorny the issue of consumer trust can be. It’s important to consider the issues of consent, data security, and audience trust when assessing the impact voice cloning could have on your brand or business.

Join our monthly newsletter

Stay up to date with the latest news and updates.

By subscribing you agree to our