The latest WMT24 analysis published on July 29, 2024, provides a preliminary ranking of general machine translation (MT) systems and large language models (LLMs). 

The findings are based on two “top performing” automatic evaluation metrics: MetricX-23-XL, a reference-based metric, and CometKiwi-DA-XL, a quality estimation metric. The researchers explained that they intentionally selected two distinct automatic metrics to minimize bias and potential problems.

For the automatic evaluation, each system received a score based on its performance according to the two metrics. These scores were then averaged to produce a final score. This final average score determines the system’s rank among all the evaluated systems and is referred to as AutoRank.

While the automatic ranking provides preliminary results, the official ranking will be based on human evaluation, which is deemed “superior” and will supersede the automatic results. “The purpose of this report is not to interpret any findings but only provide preliminary results,” the researchers emphasized.

First Large Scale LLM Evaluation

The evaluation included top commercial MT systems and top LLMs like GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro, and others, across 11 language pairs (Czech > Ukrainian, Japanese > Chinese, English > Chinese, Czech, German, Hindi, Icelandic, Japanese, Russian, Spanish (Latin America), and Ukrainian). The researchers said that “details of all systems are going to be available in the upcoming WMT24 findings.”

The initial results suggest that LLMs outperform traditional online MT models even for non-English lower-resource settings. Additionally, constrained systems — those using only specifically allowed data and models (e.g. Llama 2 or Mistral) during their training — can be competitive with online models and can reach top ranks in some cases, such as in the English to Czech language pair.

Tom Kocmi, Senior Researcher at Microsoft, noted in a post on X, this is “the first large scale blind evaluation of LLMs multilingual capabilities.”

Promising Results for Unbabel’s Tower

Unbabel’s Tower model ranked first across all language pairs. Andre Martins, Head of Research at Unbabel, described the results as “very promising” in a post on X, while Ricardo Rei, Senior Research Scientist at Unbabel, is optimistic that upcoming human evaluations will confirm these results, as noted in another post on X.

Although the result for Tower might be biased due to the use of Unbabel’s COMET model for evaluation and likely during training, Tom Kocmi highlighted in a post on X that “MetricX also confirms its superiority.”

Authors: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovic, Mariya Shmatova, Steinþór Steingrímsson, Vilém Zouhar



Source link