Google Finds ‘Refusal to Translate’ Most Common Form of LLM Verbosity – slator.com

In an October 1, 2024 paper, researchers from Google identified a key challenge in evaluating large language models (LLMs) for machine translation (MT): verbosity.

The term verbosity refers to instances where LLMs offer reasoning insights behind their translation choices, provide multiple translations, or even refuse to translate certain content.

The researchers explained that, unlike traditional MT systems — which are explicitly trained and optimized for producing a single translation for a given source text — LLMs tend to take a “more conversational approach.” This behavior challenges traditional evaluation frameworks, which are designed for more structured input-output models.

After analyzing several LLMs, the researchers found that verbosity is widespread, but its degree varies across models. OpenAI’s GPT-4 and Cohere’s Aya23 were the least verbose, whereas Google’s Gemini-1.5-Pro emerged as the most verbose LLM, often providing commentary or alternative translations. Mistral AI’s Mistral-Large and Anthropic’s Claude-3.5 exhibited moderate verbosity, while Meta’s LLaMA-3-70B, Cohere’s CommandR+, and Microsoft’s Phi-3-Medium showed low levels of verbosity.

The most common form of verbosity observed was LLMs refusing to translate certain content. For instance, Claude-3.5 frequently refused to translate, while Gemini-1.5-Pro and Mistral-Large exhibited a more balanced mix of refusal to translate and commentary, leaning slightly towards the latter.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

Triggers for Verbosity

According to the researchers, LLMs typically refuse to translate when they encounter potentially harmful or copyrighted content, or when faced with non-natural language input like URLs or code snippets.

These triggers are prioritized differently across LLMs. Claude-3.5, for instance, is particularly sensitive to safety and copyright concerns, while Gemini-1.5-Pro and Mistral-Large primarily refuse to translate non-linguistic content. Additionally, some LLMs, such as Phi-3-Medium and Aya23, return empty outputs instead of verbose explanations when they refuse to translate.

Beyond refusals, LLMs can produce verbose outputs that contextualize their translation choices, providing alternative options or additional commentary. This behavior is particularly prominent in Gemini-1.5-Pro and Mistral-Large, though it is notably absent in GPT-4 and Aya23. The researchers pointed out that “short input segments lacking sufficient context are the primary reason for this verbose behavior.”

Evaluation Challenges

One major concern raised by the researchers is that existing automatic and human evaluation frameworks do not account for verbose behaviors, often penalizing models that exhibit verbosity.

This can distort LLM performance rankings. In their analysis, models like Gemini-1.5-Pro and Claude-3.5 ranked lower when their verbose outputs were included but performed much better when verbosity was excluded from the evaluation.

“This discrepancy highlights that current […] metrics do not adequately account for the nuanced outputs, leading to potentially misleading rankings,” the researchers noted.

Context-Aware Evaluation

There are two possible solutions to address this issue: either modify LLM outputs to fit standardized evaluation metrics or update evaluation frameworks to better accommodate the varied responses of LLMs.

For example, verbosity could be minimized via prompts, or the output structure could be adjusted to separate commentary from core translations.

However, these methods do not entirely solve the problem. As the researchers pointed out, “they may not account for all verbosity-induced errors, especially refusal” and “they make no attempt to reward useful verbosity.”

Slator Pro Guide: Audiovisual Translation

The Slator Pro Guide: Audiovisual Translation is a concise guide to audiovisual translation, including dubbing, subtitling, access services, AI dubbing, AI captions, and more.

The researchers argue that context-aware evaluations are necessary to accurately assess the quality of verbose outputs. Specifically, handling cases where LLMs refuse to translate poses the greatest challenge, and they recommend that future evaluation protocols and datasets account for these behaviors.

“We hope this paper raises awareness of the premises and pitfalls of evaluating LLM outputs and inspires future studies to address them directly,” the researchers concluded.

Authors: Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag

Source link

Tagged Large Language Model, LLM, LLMs, Machine Translation, MT, OpenAI, subtitling

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

dfinck@loquatics.com

loquatics.com

linkedin.com/in/dfinck/

Berlin, Germany

Get In Touch

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

dfinck@loquatics.com

loquatics.com

linkedin.com/in/dfinck/

Berlin, Germany

Get In Touch

2024 Slator Pro Guide: Translation AI

Triggers for Verbosity

Evaluation Challenges

Context-Aware Evaluation

Slator Pro Guide: Audiovisual Translation

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

Get In Touch

Login

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

Get In Touch

Login

2024 Slator Pro Guide: Translation AI

Triggers for Verbosity

Evaluation Challenges

Context-Aware Evaluation

Slator Pro Guide: Audiovisual Translation

Login

Don't need to reset? Login

Forgot Password?