The first-ever voluntary mentorship program in speech translation, SpeechT, launched and led by Yasmin Moslem, NLP Researcher, brought together researchers, practitioners, and students from diverse companies and institutions worldwide to explore speech translation.

Running from December 2024 to January 2025, the initiative introduced participants to data collection, model training, and advanced research techniques, helping them develop hands-on expertise in speech translation.

Participants came from varied backgrounds, ranging from software engineering to text-to-text machine translation (MT), and followed a structured three-week mentorship program.

The first week focused on data preparation, where participants collected and processed bilingual speech datasets. In the second week, they trained and fine-tuned models using the datasets prepared earlier. The final week was dedicated to advanced research, where participants explored synthetic data generation, language model post-processing, and domain adaptation.

During this phase, participants experimented with synthetic data augmentation — important when the data is limited for the language or domain. They used text-to-speech (TTS) models to create synthetic source audio for existing translations, applied MT models to produce translated text from transcriptions, and refined data quality by improving alignment between audio and text using segmentation techniques and part-of-speech tagging.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

By the end of the task, participants had to share their advanced models and explain how their advanced approach improved speech translation quality compared to the original fine-tuned model, Moslem noted.

Hands-On Practice

Five language-specific projects — Galician, Indonesian, Arabic, Bengali > English, and Spanish > Japanese — examined the differences between end-to-end (E2E), where a single model directly generates translations from speech, and cascaded speech translation models, which separate automatic speech recognition (ASR) and MT.

While E2E systems offered lower latency and simplified deployment, cascaded models consistently outperformed them in translation accuracy. 

The study highlights that cascaded systems allow for individual optimization of each component and can easily integrate domain-specific MT, making them more adaptable for specialized content.

“This mentorship has enabled the participants to experiment with various system designs and fine-tuning strategies, deepening their understanding of the speech translation area through hands-on practice,” Moslem concluded.

By making its results, datasets, and models available on Hugging Face, the program provides a valuable resource for both academic research and industry applications, contributing to ongoing efforts to improve speech translation.

Authors: Yasmin Moslem, Juan Julián Cea Morán, Mariano Gonzalez-Gomez, Muhammad Hazim Al Farouq, Farah Abdou, and Satarupa Deb



Source link