In a research paper published on August 3, 2023, Minsu Kim, Jeongsoo Choi, and Yong Man Ro from Korea Advanced Institute of Science & Technology’s (KAIST) Image and Video Systems Lab, along with Dahun Kim from Google DeepMind, presented a novel approach for many-to-many spoken language translation.
The authors highlighted that dealing with multilingual speech synthesis, especially in real-time, used to be complex. The reason was that many systems required separate models for each language due to the inherent complexities of speech audio data.
“While the text is naturally discrete and only covers linguistic content, the speech audio is continuous and conveys various speaker characteristics such as voice, accent, and timbre,” they explained.
To overcome this challenge they proposed an approach that involves training a single model to learn unified representations of both multilingual speech and text using a concept called Unit-to-Unit Translation (UTUT).
The primary goal is to enable seamless translation and synthesis across different languages for both speech and text, leveraging a single model. This means that the model learns to translate spoken language from one source language into another target language using speech units as common input representations.
Speech units are essentially discretized speech features, similar to phonemes, obtained by clustering extracted speech representations from a self-supervised speech model. These speech units bridge the gap between the continuous nature of speech audio and the discrete nature of text, opening up possibilities for textless natural language processing (NLP), as noted by the authors.
The training process conditions the model by providing source language and target language tokens to the encoder and decoder, respectively. This conditioning enables the model to comprehend and translate spoken language across many-to-many language pairs. The model gains a comprehensive understanding of how different spoken languages are processed and related to each other.
“By training the model with UTUT objective and speech units, the model can construct multi- lingual knowledge for the spoken language and can perform many-to-many language translations with a single model,” said the authors.
Versatile Multilingual Capabilities
To pre-train their model, the authors used two datasets, VoxPopuli and mTEDx, containing speech-to-speech translation data in multiple languages. They demonstrated that their UTUT approach not only outperformed previous methods, particularly for many-to-many language translation, but it can also be efficiently adapted to “previously unseen language pairs,” where paired data might not exist.
Although trained without using text inputs, the model can effectively handle text inputs by learning the unified representations of speech and text through UTUT and speech units, according to the authors.

Slator 2023 Language Industry Market Report
140-page flagship report on market-size, LLM and GPT impact, TMS, AI dubbing, interpreting, game loc, market outlook, and more.
The versatility of the proposed approach in various multilingual tasks, such as speech-to-speech translation (STS), multilingual text-to-speech synthesis (TTS), and text-to-speech translation (TTST) has also been highlighted. By pre-training the model with the UTUT objective, it becomes capable of handling diverse tasks involving speech and text in multiple languages, even languages that were not paired during training.
“To the best of our knowledge, this is the first work exploring many-to-many language translation in STS and TTST,” they said.
Furthermore, the proposed method demonstrated an efficiency advantage. More specifically, it required only one-tenth of the training data used by a typical bilingual STS model for each language pair. Therefore, the authors emphasized that the proposed method “can greatly reduce training costs and save memories by using a single UTUT model instead of using multiple bilingual STS models.”