In an October 1, 2024 paper, researchers from Google identified a key challenge in evaluating large language models (LLMs) for machine translation (MT): verbosity.
The term verbosity refers to instances where LLMs offer reasoning insights behind their translation choices, provide multiple translations, or even refuse to translate certain content.
The researchers explained that, unlike traditional MT systems — which are explicitly trained and optimized for producing a single translation for a given source text — LLMs tend to take a “more conversational approach.” This behavior challenges traditional evaluation frameworks, which are designed for more structured input-output models.
After analyzing several LLMs, the researchers found that verbosity is widespread, but its degree varies across models. OpenAI’s GPT-4 and Cohere’s Aya23 were the least verbose, whereas Google’s Gemini-1.5-Pro emerged as the most verbose LLM, often providing commentary or alternative translations. Mistral AI’s Mistral-Large and Anthropic’s Claude-3.5 exhibited moderate verbosity, while Meta’s LLaMA-3-70B, Cohere’s CommandR+, and Microsoft’s Phi-3-Medium showed low levels of verbosity.
The most common form of verbosity observed was LLMs refusing to translate certain content. For instance, Claude-3.5 frequently refused to translate, while Gemini-1.5-Pro and Mistral-Large exhibited a more balanced mix of refusal to translate and commentary, leaning slightly towards the latter.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
Triggers for Verbosity
According to the researchers, LLMs typically refuse to translate when they encounter potentially harmful or copyrighted content, or when faced with non-natural language input like URLs or code snippets.
These triggers are prioritized differently across LLMs. Claude-3.5, for instance, is particularly sensitive to safety and copyright concerns, while Gemini-1.5-Pro and Mistral-Large primarily refuse to translate non-linguistic content. Additionally, some LLMs, such as Phi-3-Medium and Aya23, return empty outputs instead of verbose explanations when they refuse to translate.
Beyond refusals, LLMs can produce verbose outputs that contextualize their translation choices, providing alternative options or additional commentary. This behavior is particularly prominent in Gemini-1.5-Pro and Mistral-Large, though it is notably absent in GPT-4 and Aya23. The researchers pointed out that “short input segments lacking sufficient context are the primary reason for this verbose behavior.”
Evaluation Challenges
One major concern raised by the researchers is that existing automatic and human evaluation frameworks do not account for verbose behaviors, often penalizing models that exhibit verbosity.
This can distort LLM performance rankings. In their analysis, models like Gemini-1.5-Pro and Claude-3.5 ranked lower when their verbose outputs were included but performed much better when verbosity was excluded from the evaluation.
“This discrepancy highlights that current […] metrics do not adequately account for the nuanced outputs, leading to potentially misleading rankings,” the researchers noted.
Context-Aware Evaluation
There are two possible solutions to address this issue: either modify LLM outputs to fit standardized evaluation metrics or update evaluation frameworks to better accommodate the varied responses of LLMs.
For example, verbosity could be minimized via prompts, or the output structure could be adjusted to separate commentary from core translations.
However, these methods do not entirely solve the problem. As the researchers pointed out, “they may not account for all verbosity-induced errors, especially refusal” and “they make no attempt to reward useful verbosity.”
Slator Pro Guide: Audiovisual Translation
The Slator Pro Guide: Audiovisual Translation is a concise guide to audiovisual translation, including dubbing, subtitling, access services, AI dubbing, AI captions, and more.
The researchers argue that context-aware evaluations are necessary to accurately assess the quality of verbose outputs. Specifically, handling cases where LLMs refuse to translate poses the greatest challenge, and they recommend that future evaluation protocols and datasets account for these behaviors.
“We hope this paper raises awareness of the premises and pitfalls of evaluating LLM outputs and inspires future studies to address them directly,” the researchers concluded.
Authors: Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag