Machine translation (MT) models often struggle with linguistic diversity, favoring dominant dialects and leaving many language varieties underserved.

In a February 20, 2025 paper, researchers from the University of Porto, INESC TEC, Heidelberg University, University of Beira Interior, and Ci2 – Smart Cities Research Center introduced Tradutor, the first open-source AI translation model specifically tailored for European Portuguese.

Tradutor aims to fill the gap left by many translation models that focus mainly on Brazilian Portuguese, which is used by the majority of Portuguese speakers.

The researchers explained that most MT systems prioritize Brazilian Portuguese, leaving speakers from Portugal and other regions at a disadvantage. This can be particularly problematic in areas like healthcare and legal services, where accurate language use is crucial.

To address this, the researchers developed PTradutor, an extensive parallel corpus comprising over 1.7 million documents in both English and European Portuguese. This dataset spans diverse domains, including journalism, literature, web content, politics, legal documents, and social media, providing a rich linguistic foundation for training. 

“We provide the community with the largest translation dataset for European Portuguese and English,” they said.

The corpus was meticulously curated through a process of collecting monolingual European Portuguese texts, translating them into English with Google Translate — due to its accessibility and relatively high quality — and implementing rigorous quality checks to maintain data integrity.

Using this dataset, the researchers fine-tuned three open-source large language models (LLMs) — Google’s Gemma-2 2B, Microsoft’s Phi-3 mini, and Meta’s LLaMA-3 8B — to create an AI translation model adept at translating English into European Portuguese. The fine-tuning process involved both full model training and parameter-efficient techniques like Low-Rank Adaptation (LoRA).

Significant Accomplishment

Early tests show that Tradutor performs better than many existing open-source systems and gets close to some of the best closed-sourced industry models.

Specifically, the fine-tuned LLaMA-3 8B model outperformed existing open-source systems and approached industry-standard closed-source models, such as Google Translate and DeepL, in translation quality.

“Our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese,” the researchers highlighted.

They also emphasized that the goal was not necessarily to surpass commercial models but to “propose a computationally efficient, adaptable, and resource-efficient method for adapting small language models to translate specific language varieties.” Achieving results close to industry-leading models marks a “significant accomplishment,” according to the researchers.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

While Tradutor was developed as a case study for European Portuguese, the researchers noted that the same methodology could be applied to other languages facing similar challenges.

By open-sourcing the PTradutor dataset, the code to replicate it, and the Tradutor model, they aim to encourage further research and development in language variety-specific MT, promoting greater linguistic inclusivity in AI-powered systems.

“We aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties,” they concluded.

Authors: Hugo Sousa, Satya Almasian, Ricardo Campos, and Alípio Jorge



Source link