In a March 31, 2025 paper, researchers from the AI Lab at Giant Network, a China-based online game development company, introduced DeepDubber, an AI system that aims to deliver “high-quality movie dubbing,” according to the research team.

Giant Network is known for developing multiplayer online role-playing games (MMORPGs). In recent years, the company has been investing heavily in AI. Its AI Lab has developed models such as GiantGPT for game dialogue generation and BaiLing-TTS for speech synthesis. 

As part of a broader effort to integrate AI into game production, the company recently launched QianYing, an AI model initiative that includes tools for game video generation (YingGame) and video-guided sound generation (YingSound).

DeepDubber builds on this foundation. Unlike traditional models that primarily focus on syncing speech with lip movements or reproducing basic voice styles, DeepDubber takes a more context-aware and adaptive approach.

The system not only aligns speech with visuals but also understands the context of a scene — whether it’s dialogue, narration, or monologue — and adjusts the voice output to reflect the speaker’s age, gender, and emotion.

The researchers noted that aspects such as adapting to different dubbing styles and capturing fine-grained speaker characteristics have received limited attention so far. To address this, DeepDubber uses multimodal large language models (LLMs) and chain-of-thought (CoT) reasoning to better understand and adapt to each scene.

DeepDubber works in two stages. First, it processes the video and subtitles using a multimodal LLM trained to analyze both visual and textual content. It then applies step-by-step reasoning to identify key elements, such as how many people appear on screen, who is speaking, and what kind of voice-over fits the context. 

By thinking through these steps systematically, DeepDubber builds a detailed understanding of the dubbing needs for each scene. 

Once the scene is understood, the second stage begins. Here, DeepDubber generates speech that matches the visuals, the script, and the speaker characteristics identified in the first step.

Performance Gains

To support this approach, the researchers created a dedicated dataset with annotated dubbing scenarios. Each entry in the dataset includes structured reasoning stages, such as summaries and conclusions, allowing the model to learn how to make dubbing decisions from examples. 

These structured annotations help the model learn how to make dubbing decisions and improve its ability to generalize across different scenarios and speakers.

MAIN IMAGE - Slator Pro Guide Audiovisual Translation

Slator Pro Guide: Audiovisual Translation

The Slator Pro Guide: Audiovisual Translation is a concise guide to audiovisual translation, including dubbing, subtitling, access services, AI dubbing, AI captions, and more.

The researchers evaluated DeepDubber against industry baselines such as StyleDubber, Speaker2Dub, and HPMDubbing on benchmarks including the V2C-Animation and GRID datasets. 

The results showed performance gains. DeepDubber delivered more accurate speech, better speaker similarity, lower word error rates, and improved lip-sync alignment. Even in zero-shot scenarios using previously unseen speakers, it maintained strong performance, demonstrating robust generalization.

“These results demonstrate the effectiveness of our multimodal reasoning stages in enhancing multimodal movie dubbing performance,” the researchers noted.

Authors: Junjie Zheng, Zihao Chen, Chaofan Ding, and Xinhan Di



Source link