In a February 7, 2025 paper, researchers from Chinese tech company Xiaomi benchmarked the capabilities of open-source large language models (LLMs) with under ten billion parameters for multilingual machine translation (MT) tasks. They proposed the “best data recipe” to enhance AI translation performance.
The researchers explained that open-source LLMs have shown improvements in multilingual capabilities, with even small-scale open-source models like Mistral-7B, Qwen2-7B, LLaMA3-8B, and Gemma2-9B demonstrating gains in translation quality. However, “these models […] still fall short compared to closed-source models,” they noted.
Their evaluation found that Gemma2-9B outperforms all other open-source LLMs, followed by LLaMA3.1/3-8B, then Qwen2/2.5-7B, with Mistral-7B ranking last.
Previous studies have attempted to boost LLM translation capabilities using multilingual corpora — monolingual and parallel datasets — during continual pretraining. However, the best way to combine these datasets remains unclear, according to the researchers.
The Xiaomi team set out to systematically explore the optimal mixing strategy for monolingual and parallel data to achieve the best multilingual MT results.
“Our work differs in that we primarily focus on exploring the optimal mixing strategy of monolingual and parallel data during continual pretraining to achieve the best translation performance,” they said.
Best Data Recipe
To determine the “best data recipe,” the researchers took Gemma2-9B — the top-performing open-source LLM in their evaluation — and tested five different data-mixing configurations, evaluating the impact of varying monolingual-to-parallel data ratios.
They found that relying solely on monolingual data is suboptimal, as it negatively affects translation quality for high-resource languages in xx→en directions. Incorporating parallel data at any volume consistently improves translation quality, particularly for high-resource languages, while low- and mid-resource languages benefit more from monolingual data due to the scarcity of parallel corpora.
Prioritizing parallel data while supplementing it with monolingual data, outperforms other approaches across most language pairs.
“Based on our experimental results, we propose a Parallel-First Monolingual-Second (PFMS) data mixing strategy, where we give higher priority to parallel data than monolingual data when preparing the continual pretraining dataset,” they explained.
Using the PFMS approach, the researchers trained GemmaX2-28-9B, which consistently outperformed state-of-the-art models like TowerInstruct and X-ALMA. The model even achieved competitive performance with Google Translate and GPT-4-turbo.
The researchers also tested a smaller variant, GemmaX2-28-2B, to evaluate the impact of scaling. While the 2B model showed strong multilingual translation capabilities relative to its size, the 9B model consistently outperformed it across all benchmarks.
Looking ahead, the Xiaomi researchers “aim to develop models that support a broader range of languages and possess enhanced translation capabilities.”
GemmaX2-28-9B is now available on Hugging Face.
Authors: Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, and Bin Wang