In a May 4, 2025 paper, researchers at IIT Bombay introduced a new approach to speech-to-speech translation (S2ST) that not only translates speech into another language but also adapts the speaker’s accent.

This work aligns with growing industry interest in accent adaptation technologies. For example, Sanas, a California-based startup, has built a real-time AI accent modification tool that lets users change their accent without changing their voice. Similarly, Krisp offers AI Accent Conversion technology that neutralizes accents in real time, improving clarity in customer support and business settings.

While Sanas and Krisp focus on accent adaptation alone, the IIT Bombay researchers explore how accent and language translation can be combined in a single model.

“To establish effective communication, one must not only translate the language, but also adapt the accent,” the researchers noted. “Thus, our problem is to model an optimal model which can both translate and change the accent from a source speech to a target speech,” they added.

Scalable and Expressive Cross-Lingual Communication

To do this, they proposed a method based on diffusion models, a type of generative AI typically associated with image generation — DALL-E 2, which creates realistic images based on the user’s text input, is an example of diffusion models — but their applications extend to other domains, including audio generation.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

They implemented a three-step pipeline. First, an automatic speech recognition (ASR) system converts the input speech into text. Then, an AI translation model translates the text into the target language. Finally, a diffusion-based text-to-speech model generates speech in the target language with the target accent.

So, the core innovation lies in the third step, where the researchers used a diffusion model for speech synthesis. In this case, instead of creating images, the model generates mel-spectrograms (i.e., visual representations of sound) based on the translated text and target accent features, which are then turned into audio. For this, the researchers used GradTTS, a diffusion-based text-to-speech model, as the foundation of their system.

They tested their model on English and Hindi, evaluating its ability to generate speech that reflects both the correct translation and target accent. “Experimental results […] validate the effectiveness of our approach, highlighting its potential for scalable and expressive cross-lingual communication,” they said.

The researchers acknowledged several limitations, but they still see this as a promising starting point. “This work sets the stage for further exploration into unified, diffusion-based speech generation frameworks for real-world multilingual applications,” they concluded.

Authors: Abhishek Mishra, Ritesh Sur Chowdhury, Vartul Bahuguna, Isha Pandey, and Ganesh Ramakrishnan



Source link