In a March 11, 2025 paper, Unbabel introduced MINTADJUST, a method for more accurate and reliable machine translation (MT) evaluation.
MINTADJUST addresses metric interference (MINT), a phenomenon where using the same or related metrics for both model optimization and evaluation leads to over-optimistic performance estimates.
The researchers identified two scenarios where MINT commonly occurs and distorts evaluation:
- When a metric is used to select high-quality training data, and the same or a similar metric later evaluates the resulting systems.
- When a metric is used to select the best translation from a pool of candidates at inference time, and the same or a similar metric later evaluates the final outputs.
In both cases, the “interfering metric” shows a preference for systems optimized with it, creating a distorted view of actual translation quality and misleading researchers and practitioners.
Talking to Slator, José Maria Pombal, a Research Scientist at Unbabel, explained that this happens because MT metrics carry biases. When a metric is used for optimization, the MT model inherits these biases, and during evaluation, it tends to score much higher on the same metric.
Pombal emphasized that “this bias in evaluation erodes the trust users build for both the evaluation metric and the MT model.”
The researchers also noted that this issue has become even more relevant with the rise of large language models (LLMs), which are increasingly used for both translation and evaluation. LLMs can exhibit systematic biases, particularly when evaluating outputs from their own model family.
A common strategy to mitigate MINT is to use different metrics for optimization and evaluation. However, the researchers argue that this is “insufficient” to mitigate MINT, “since most metrics are trained on similar data and architectures and so may share similar behaviours.”
“Performing evaluation with metrics that are spuriously correlated with the interfering metric is potentially just as misleading as using the interfering metric itself, bringing common practices into question,” they noted.
The Groundwork for Better Evaluation
To tackle this challenge, the researchers developed MINTADJUST, a method that learns to predict and correct the scores of an interfering metric using a set of other, less biased metrics.
Specifically, MINTADJUST learns from a dataset of source texts, model-generated translations, and reference translations from systems not affected by MINT. This establishes a baseline (i.e., a bias-free reference point).
When applied to MINT-affected models, MINTADJUST takes the scores from multiple alternative metrics as input and provides a corrected score for the interfering metric — one that better aligns with human judgments.
The researchers highlighted that MINTADJUST isn’t a new evaluation metric. Instead, it adjusts the scores of existing metrics to account for the distortions caused by MINT. This approach maintains the interpretability of familiar metrics while improving their reliability in MINT scenarios.
On the WMT24 MT shared task test set, they found that MINTADJUST ranked translations and systems more accurately than state-of-the-art metrics across multiple language pairs. The method was especially effective for high-quality systems.
With this work, the researchers “lay the groundwork for better evaluation practices,” offering a solution to one of MT evaluation’s most persistent challenges.
Authors: José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins