A new study by researchers from Google and Imperial College London challenges a core assumption in AI translation evaluation: that a single metric can capture both semantic accuracy and naturalness of translations.

“Single-score summaries do not and cannot give the complete picture of a system’s true performance,” the researchers said.

In the latest WMT general task, they observed that systems with the best automatic scores — based on neural metrics — did not receive the highest scores from human raters. “This and related phenomena motivated us to reexamine translation evaluation practices,” they added.

The researchers argue that translation quality is fundamentally two-dimensional, encompassing both accuracy (also known as fidelity or adequacy) and naturalness (also known as intelligibility or fluency). 

In this paper, they “mathematically prove and empirically demonstrate” that these two goals are in inherent tradeoff — and optimizing for one often degrades the other, a point echoed in a recent Slator article that found separating accuracy and fluency improves AI translation evaluation.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

To support this, they evaluated submissions to the WMT24 shared task, including large language models (LLMs) like GPT-4 and Claude, and AI translation systems such as Unbabel’s Tower70B

They found that systems with the highest scores on automatic metrics did not always match human preferences. All systems performed below the accuracy-naturalness curve, which represents the best tradeoff a system can achieve between conveying meaning and sounding fluent. Those rated highest by human raters came closest to this curve. One example was Unbabel’s system, which scored high on adequacy but was rated lower on fluency, likely due to being optimized too heavily for accuracy.

A Call for Change

The researchers emphasized that their “first and foremost goal” was to “make the community aware of this tradeoff.” While the paper is mainly theoretical, they believe that “it can have important consequences in practice.”

They argue that the AI translation community should rethink how AI translation quality is assessed. “We advocate for a change in how translations are evaluated,” they said.

Instead of relying on a single number, they propose evaluating AI translations along an “accuracy-naturalness plane.” This approach would allow developers and users to tailor system behavior to specific use cases — whether legal, technical, or creative content. 

“We suggest to expand future evaluations to explicit consider again a distinction between accuracy and fluency (or similar dimensions) when evaluating MT systems,” they concluded.

Authors: Gergely Flamich, David Vilar, Jan-Thorsten Peter, and Markus Freitag



Source link