Researchers from Google and Unbabel have unveiled WMT24++, a major expansion of the WMT24 machine translation (MT) benchmark, extending its language coverage from 9 to 55 languages and dialects.
The dataset now includes human-written reference translations and post-edits for 46 additional languages, as well as new post-edits of the references for 8 of the original 9 WMT24 languages. The benchmark covers four domains: literary, news, social, and speech.
To compile WMT24++, the researchers collected translations from professional linguists who were “fairly compensated for their work for the region in which they live.”
The researchers emphasized the importance of collecting benchmark datasets to evaluate multilingual large language models’ (LLMs) performance, particularly in MT.
“As large language models become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation,” they said.
Markus Freitag, Head of Google Translate Research, highlighted on X that WMT24++ is Google’s second major dataset release in two days, following SMOL, a professionally translated dataset for 115 very low-resource languages. “Two new datasets from Google Translate targeting high and low resource languages!” he wrote.
LLMs Outperform Traditional MT Systems
The researchers benchmarked leading MT providers and LLMs on WMT24++ using automatic metrics, including reference-based and reference-free automatic metrics, such as MetricX-24 and MetricX-24-QE, COMET-based models, and Gemini-based scoring.
They found that LLMs outperformed traditional MT systems across all 55 languages. OpenAI’s o1, Google’s Gemini-1.5 Pro, and Anthropic’s Claude 3.5 ranked as the top-performing systems, surpassing conventional MT providers such as Google Translate, DeepL, and Microsoft Translator.
“Frontier LLMs, like OpenAI o1, Gemini-1.5 Pro, and Claude 3.5, are highly capable MT systems in all 55 languages (according to automatic metrics), outperforming standard MT providers,” the researchers noted. They also found minimal performance differences between the top LLMs.
Need for Human Evaluation
Despite LLMs outperforming traditional MT in automatic evaluation, the researchers caution against overestimating their capabilities.
They stress that automatic metrics may undervalue human translations due to inherent biases, and their effectiveness remains largely untested in many of the 55 languages covered by WMT24++.
The researchers acknowledge that human evaluation remains crucial for assessing actual translation quality and understanding LLM limitations and plan to conduct a large-scale human evaluation in future work to validate these findings.
“We caution against using our results to immediately conclude that LLMs produce superhuman performance in all languages due to the limitations of automatic metrics, which may be biased against human translations and largely untested in most of the 55 languages,” they said.
In addition to the textual data, WMT24++ also preserves source images where available, providing full-page screenshots that researchers hope will support multimodal translation studies.
The dataset is publicly available on Hugging Face.
Authors: Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, and Markus Freitag