Gender bias in speech translation (ST) systems has long been a concern for researchers and users alike. In a January 10, 2025 paper, researchers from Microsoft Speech and Language Group presented their approach to addressing speaker gender bias in large-scale ST systems.
The researchers identified a persistent masculine bias in ST systems, even in cases where the speaker’s gender is evident from audio cues. This bias — often inherited from machine translation (MT) models used during training — results in incorrect or offensive translations, particularly for female speakers.
They explained, “gender bias can arise in different linguistic levels, such as lexical, morphological, and syntactic… [and] may result in inaccurate translations, as the system fails to respect the speaker’s gender identity.”
Another significant obstacle is the scarcity of diverse training data containing accurate gender forms.
To address these challenges, the researchers proposed a two-fold solution: leveraging large language models (LLMs) to rectify gender-specific translations in training data and fine-tuning ST models with this corrected dataset.

2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
Gender-Debiased Translations
Specifically, they used GPT-4 to generate both masculine and feminine versions for the speaker for a subset of the training data (2 million utterances). This allowed them to create “gender-debiased training targets,” ensuring the outputs aligned with the speaker’s identity. With this enhanced dataset, they fine-tuned the ST models to accurately infer gender directly from audio inputs.
To provide users with greater flexibility, the researchers also introduced a “three-mode” training:
- Masculine Mode — produces translations exclusively in the masculine form.
- Feminine Mode — produces translations exclusively in the feminine form.
- Auto Mode — automatically infers and applies gender-specific translations based on audio cues.
“Our work proposes to adapt the ST model architecture that can generate accurate speaker gender forms from audio inputs in an ‘Auto’ mode or allow the user to choose the desired speaker gender form in a ‘Masculine’ or ‘Feminine’ mode, respecting the diversity of speakers,” they explained.
The researchers tested their method on English-to-Spanish and English-to-Italian translation tasks, achieving over 90% accuracy in gender-specific translations. This represents a significant improvement compared to existing systems like Meta’s SeamlessM4T and Nvidia’s Canary, they noted.
Looking ahead, the researchers plan to expand their work to include non-binary and transgender speakers. “Future work may involve exploring bias of other types in large-scale ST systems, reducing bias for non-binary speakers, and better fine-tuning approaches,” they concluded.
Authors: Shubham Bansal, Vikas Joshi, Harveen Chadha, Rupeshkumar Mehta, and Jinyu Li