How Amazon Aims to Improve Lip-Sync in AI Dubbing – slator.com

January 13, 2025


In a December 21, 2024 paper, researchers from Amazon introduced the audio-visual speech-to-speech translation (AVS2S) framework, designed to improve lip-synchrony in AI dubbing without altering the original visual content.

Existing dubbing systems frequently modify visual elements to synchronize with generated audio. This often leads to disjointed experiences, raises ethical concerns, and compromises the integrity of the original video, the researchers explained. 

With their framework the group aims to overcome these issues by improving speech translation while preserving the original visuals. “Unlike the previous approaches, we explore a research area with two realistic constraints 1) where the original videos are preserved 2) voice characteristics of original speakers are not mimicked,” they said.

The researchers identified another key issue in existing models: they do not utilize lip-synchrony as a constraint in model training.

In their work, they use visual inputs to improve lip-synchrony between the translated speech and the original video, without generating new visuals, and most importantly they integrate duration and synchronization losses into the training process of AVS2S models.

Duration loss ensures the timing of the translated speech matches the original. A duration predictor estimates how long each word or sound should last when it’s spoken and compares this to the actual timing needed for the original speech. The goal is to minimize any differences, ensuring that the translated speech fits well with the lip movements in the video.

Synchronization loss measures how well the audio (the translated speech) matches the visual (the lip movements in the video). A special model called SyncNet checks how closely the audio and video align and gives a score based on how well the audio matches the lip movements. The aim here is to maximize this score, ensuring that the audio and visuals are perfectly in sync.

In short, duration loss ensures the timing is right, while synchronization loss ensures the audio matches the lip movements, leading to better dubbing quality overall.

Improved Viewing Experience

The AVS2S framework operates by first encoding the original video, which includes both visuals (lip movements) and audio (spoken dialogue), using a pre-trained audio-visual encoder, and converting the input into discrete audio-visual units.

Next, these audio-visual units are translated from the source language to the target language using an encoder-decoder network. A duration predictor is employed to estimate the duration of each speech unit, and duration loss ensures that the timing of the translated speech aligns properly with the original speech.

After translation, a vocoder generates high-quality speech from the translated audio-visual units. Finally, the translated speech is overlaid onto the original video, with synchronization loss applied to maintain lip-synchrony, ensuring that the audio matches the visual cues effectively.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

The researchers evaluated the effectiveness of their framework using various metrics, including lip-synchrony scores and translation quality assessments. They found that their method significantly outperformed baseline models, achieving higher lip-synchrony scores across multiple language pairs, including English to Spanish, Portuguese, Italian, and French.

“Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation” they noted.

The researchers concluded that their approach could improve the overall viewing experience by ensuring seamless alignment of the generated speech with the original video.

Authors: Lucas Goncalves, Prashant Mathur, Xing Niu, Brady Houston, Chandrashekhar Lavania, Srikanth Vishnubhotla, Lijia Sun, and Anthony Ferritto



Source link