Automating translation of emotion across languages: a closer look under the hood

Automating translation of emotion across languages: a closer look under the hood | Papercup Blog

by Team Papercup

June 6, 2024

4 min read

At Papercup, our mission has always been clear: make the world’s videos watchable in any language. We’re not just aiming for translation; we’re revolutionizing how voices are dubbed, creating synthetic voices so realistic that they’re indistinguishable from human speech.

To deliver premium video content, we’ve implemented a human-in-the-loop system. This approach blends the scalability of AI with the precision of professional translators, who use their deep domain knowledge to ensure that every nuance and subtlety of the original content is preserved. For instance, translating the nuanced language of poker requires specific expertise to capture the true essence of the game in another language.

But to truly scale our mission, we needed a technological breakthrough. Today, we’re excited to introduce the market's first system, which can translate, map, and control emotion from one language to another automatically. Let’s dive into how this works.

The basic AI dubbing model overview

Our system processes a source video by first generating a script based on the speakers on screen, then translating it, and finally creating synthetic speech. Professional translators (HiTL) then fine-tune this output. But our latest breakthrough takes this a step further by enabling the seamless transfer of emotion, something never before achieved in a scalable system.

Typical AI dubbing workflow

Automated solutions have historically struggled with accurately mapping prosody and emotion from one language to another. Manual adjustments or voice actors are often needed to drive the emotion in a synthetic voice, making these solutions either unscalable or unsuitable for premium content.

Introducing XLPT (cross-lingual prosody transfer)

You might be wondering, what exactly is prosody? Prosody refers to the patterns of stress and intonation in speech—essentially, the rhythm and melody that convey the true context and emotion behind words.

Read: Teaching computers to speak: the prosody problem to learn more about prosody.

Tl;dr the way you say something changes the meaning of it. Example: “That’s interesting” can be said sarcastically or genuinely. Some things you can play with to change prosody:

Emphasizing certain words

How quickly/slowly you speak

Putting pauses for dramatic effect

Yelling vs speaking at a normal volume vs whispering

Laughing while talking

Our new XLPT system changes the game

Papercup's XLPT Workflow

Automatically mapping expressivity from the source audio without the need for human intervention
Training custom models on customer data for enhanced expressivity
Enhanced levels of control over emotion, intonation, and tone where needed

This leaves us with the first in-production AI voice system that can automatically translate and map emotion from one language to another, capturing the true essence of the content and all the nuances of language at a scale never before seen.

In this video, you can see our advanced AI automatically translating emotion, intonation, and utterances from the source speech into another language without human intervention.

With our XLPT system, we are not just translating words—we are translating emotions, making every piece of content as engaging and impactful as the original. This is the future of AI dubbing, and it’s here now.

If you’d like to see this in action, simply book a demo with one of our consultants.

In this article

The basic AI dubbing model overview

Introducing XLPT (cross-lingual prosody transfer)

Our new XLPT system changes the game

Related Blogs

The overall average watch time and completion on our new Spanish Sky News channel is so far above and beyond what we had expected. That’s a testament to the quality of the Papercup solution and then how it has transformed into positive user behavior that shows us how they consume content.

Andy Gill

Audience & Partnerships Manager at Sky News

The primary driver for looking at AI dubbing was being able to increase our viewer base and revenue and to do that through using videos from our existing archive. Previously, we have tried different approaches for translations before such as subtitling but more recently we’ve been interested by the new YouTube MLA feature, which has the potential to drive audience expansion.

John Montoya

Senior Director, Content Strategy at Vice