Voice generation is the process of using computers to create synthetic voices from different types of input data. Now more than ever AI voice generation products take many different forms: from fully automated text-to-speech tools with and without human quality check, to fully end-to-end speech-to-speech services with expert human quality control. To boot, different companies package their products differently to suit the needs of their target industry.
For the uninitiated this fast-evolving field can be hard to navigate. To help companies better understand the industry and find a provider that best suits their needs, here are 7 of the best speech generation products making waves in the space.
To help you navigate this blog, we’ve split it into sections:
1. End-to-end video dubbing services (those designed specifically to tackle video content that take care of transcription, dubbing and include post-production services like video editing and subtitles)
2. Fully automated voice generation services (those that are ‘do it yourself' and tend to have a variety of use cases, use a range of input data and produce both audio, transcripts and fully dubbed video. They tend not to have human quality check as part of the package because this is done by the user.)
In one line: End-to-end AI dubbing with a library of over 100 highly realistic AI voices and an expert translation team for quality assurance.
Papercup’s AI dubbing technology localizes videos for media companies and creators who want to unlock their content’s global potential. It uses a library of lifelike AI voices to dub into many of the world’s most-spoken languages.
Its end-to-end dubbing solution covers everything from transcription and dubbing to post-production video editing. The AI creates a script from the video and translates it into the target language using a library of lifelike voices with all the warmth and intonation of real speech. Expert translators ensure 99% accuracy by checking every word of the script and adjusting the speaker style, tone and pronunciation of the translated audio.
Turnaround is a fraction of the traditional dubbing process, it doesn't require the manual stitching together of lots of different voice generation solutions and results are trusted by media stalwarts like Insider, Sky News, and Jamie Oliver.
In one line: Expressive end-to-end AI dubbing for film and TV, plus a self-service speech-to-speech offering called Deepdub GO for content creators and gaming studios.
DeepDub’s voice cloning technology and end-to-end service was designed for Hollywood-grade entertainment. The AI-driven platform ‘incorporates humans at various stages' of the dubbing process on a needs basis, for example to add idioms, so that the system can keep learning.
In 2023, it announced the launch of DeepDub GO, a self-service offering aimed at indie game studios, advertising agencies, and content creators. It’s currently in early access. Users can opt for voice cloning or make use of DeepDub’s own platform-generated voices to translate content into other languages.
Deepdub’s end-to-end service accepts a variety of source materials including video, the AI system then creates the translated audio, and the service includes post-production adapting and mixing.
Using the DeepDub Go app, creators enter their input language, desired target languages, and upload their video content. DeepDub automatically transcribes and translates the video content into the target language. Creators can also use their own voice as a guide to prompt the software’s vocal emotional expression.
Deepdub GO is an automated service, so users need to assign multiple speakers, select a voiceover option, and adjust tone themselves. They also have to edit their own transcripts and audio, if required, which could be tricky if they don’t speak the output language.
In one line: End-to-end and automated speech-to-speech voice cloning technology for filmmakers, game developers and content creators.
Respeecher offers voice cloning software for filmmakers, game developers, and content creators that replicates voices in the original language. It also offers translations of the target voice, but is working on improving accent control as some of its translated voices ‘may have US English accents.’
Users give the company a high-quality recording of the target voice (the voice they want to clone), alongside a recording of the source voice data (someone reading the content they would like the cloned voice to narrate). AI technology then creates a voice swapping model, and any audio files in the source voice can be converted into the target voice.
We’ve put this in the end-to-end group, but it’s somewhere in-between automated and managed. Respeecher started out focusing predominantly on cloning high profile voices for film – like James Earl Jones as Darth Vader and Richard Nixon in In Event of Moon Disaster. For highly expressive, high profile content like this, it’s safe to assume that Respeecher works with creators on the post-production. More generally, it markets itself as automated, so users upload the required speech and cloned voices are returned as audio files.
Voice cloning technology lets creators make edits at any stage of production, without having to re-record with key parties.
In one line: ‘Do it yourself’ dubbing for YouTube videos that requires users to edit their transcripts and translations.
Aloud originated in Area 120, Google's in-house incubator. In June 2023, YouTube announced a direct integration with the Aloud’s technology that allows creators to access automated dubbing and translation for YouTube videos.
The service automatically transcribes and translates video content to produce dubbed audiovisual output. However, there’s no built-in quality assurance, so users have to review their own transcripts and translation for accuracy if they want high quality results. This requires users to understand the output language.
Aloud is in early access, but YouTube is currently testing it with hundreds of creators. Its public offering can only dub videos from English into Spanish or Portuguese, with more languages tipped to be added soon. However, the free-to-use service gains points for its fast turnaround and an intuitive interface that lets users pull in YouTube videos directly from a URL.
In one line: Fully automatic bi-directional (from English to other languages and from other languages to English) speech-to-speech dubbing.
Apptek offers automatic speech-to-speech dubbing (video in, audio out) and subtitling and captioning services. Its speech-to-speech dubbing technology merges the newly translated audio with the original video.
Although the company is focused on automatic dubbing advances, they admit post-editing is often recommended to improve content accuracy and vocal expressivity. For this, they require a correct transcript in the source language, and often a professionally corrected translation in the target language as well.
Apptek’s dubbing is designed to function entirely automatically. Users input a video, and the solution captures the text from the spoken audio and splits it out into multiple speakers to create a usable script on which to base the translation. Using machine learning, speaker characteristics and vocal timing are noted and corrected for before the translated audio is created.
Apptek has stated it wants to improve the emotion and prosody of its current synthetic voices, which is why it currently markets itself at news and media, enterprise, and e-learning use cases over Hollywood.
In one line: Text-to-speech generative AI that provides expressive audio output in a variety of languages.
Founded in 2022, ElevenLabs is a relatively new provider that offers AI text-to-speech voice cloning, pre-made voices and voice design (which lets users create their own voices) for audiobooks and gaming companies.
The solution is able to capture human intonation and inflections for high-quality results, however while it supports speech generation in multiple languages, it was originally created for monolingual (English-to-English) speech generation so, as its website states, output languages may have the accent of the original.
ElevenLabs has plans to launch an AI dubbing tool later this year, however its current focus is primarily on audio content.
Users have to input a text file or transcript to generate spoken output, unless they’re opting to clone their voice in which case a 30-minute voice sample audio is required. ElevenLabs audio output consists of an audio file only, which creators or businesses then have to edit themselves if they want to tweak tone, expression, and intonation.
In one line: Self-service video dubbing service that supports lots of languages and requires user quality assurance
Dubverse’s video dubbing platform lets users upload videos and translate the audio into 30+ languages. It’s free to start and comes with over 200 speaker options. The service is fast and gives users access to lots of languages, although the number of languages and voices on offer suggests that these may be generic third party voices, rather than custom-made ones that tend to have higher expressivity.
Users have to edit their video transcription for errors themselves, although Dubverse offers trained language reviewers, on a needs basis, who can assess the quality of the written translation for their customers.
Navigating the world of AI speech generation can be tricky, but making an informed decision can be the key to a winning localization strategy. Creators should consider their own levels of expertise and free time when weighing up the benefits of plug-and-play self-service products against end-to-end localization solutions.
For a fully end-to-end service that includes human quality assurance and supports many of the world’s most-spoken languages with high levels of expressivity, get in touch. Book a demo with a Papercup AI dubbing expert to learn more.
Stay up to date with the latest news and updates.