Large Language Models Struggle to Evaluate Long AI Translations, Amazon Finds – slator.com

A new study from Amazon has revealed a limitation in using large language models (LLMs) to evaluate AI translation quality: performance drops as input length increases. While LLMs are increasingly used for high-quality sentence-level AI translation evaluation, the study finds that these models become “less reliable when evaluating long-form translation outputs.” Amazon researchers Tobias Domhan […]
Unbabel Tackles Metric Bias in AI Translation – slator.com

In a March 11, 2025 paper, Unbabel introduced MINTADJUST, a method for more accurate and reliable machine translation (MT) evaluation. MINTADJUST addresses metric interference (MINT), a phenomenon where using the same or related metrics for both model optimization and evaluation leads to over-optimistic performance estimates. The researchers identified two scenarios where MINT commonly occurs and […]