In a March 6, 2025 paper, researchers from China-based institutions the Shanghai AI Laboratory, Westlake University (Hangzhou), and Northeastern University (Shenyang) demonstrated that large language models (LLMs) still suffer from “translationese” — overly literal and unnatural translations that deviate from native linguistic norms.

They explained that while previous research has explored translationese in traditional machine translation (MT) systems, there has been limited work on whether this issue persists in LLMs.

“To our knowledge, this is the first systematic study addressing translationese in LLMs,” they said.

Given that LLMs are trained on vast corpora of native-language text, one might expect them to be less susceptible to translationese and more capable of producing natural translations. However, their study reveals the opposite: LLMs still produce “unexpected” unnatural translations and translationese remains a “persistent challenge” for AI translation.

The researchers evaluated various LLMs, including GPT-4, GPT-3.5, ALMA-7B/13B, and Mistral-7B, in the English-Chinese and German-English language pairs. They found that all LLMs exhibit “significant translationese errors” in both language pairs. 

Specifically, more than 40% of GPT-4’s translations contained translationese errors, while Mistral-7B had the highest rate at 76% for the English-Chinese language pair. Additionally, larger models produced more natural translations than smaller ones.

“Polishing” Helps

The researchers first explored whether prompting strategies could reduce translationese. In addition to a standard translation prompt (Please translate the following {source_language} text to {target_language}), they tested two alternatives: a “specified” and a “polishing” prompt. 

The specified prompt includes specific requirements that intend to improve naturalness, while the polishing prompt instructs the model to refine its own translations in a two-step process: first generating a translation, then improving it.

Interestingly, the researchers found that merely specifying naturalness requirements in prompts did not reliably reduce translationese — and in some cases, made translations worse. For example, under specified prompts, GPT-4 exhibited an increase in translationese errors.

Conversely, asking LLMs to refine their own translations proved more effective. In particular, GPT-4 reduced translationese from 43% to 25% when it was instructed to polish its outputs.

Supervised Fine-Tuning Reinforces Translationese

According to the researchers, this suggests that LLMs are not inherently prone to translationese, but that supervised fine-tuning (SFT) introduces biases by prioritizing faithfulness (i.e., literal semantic mapping) over fluency.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

Specifically, over 34% of fine-tuning training data examined by the researchers exhibited translationese, reinforcing unnatural patterns in model outputs.

To mitigate translationese bias in SFT data, they proposed two strategies:

  1. Using LLMs to “polish” gold reference translations before fine-tuning.
  2. Using LLMs to filter and remove unnatural translations from training data.

Experiments with Llama-3.1-8B and Qwen-2.5-7B showed that these methods can improve translation quality. “These approaches significantly reduce translationese while improving translation naturalness,” the researchers said.

“These findings underscored the importance of addressing data quality and training methodologies in developing robust and natural translation systems,” they concluded.

Authors: Yafu Li, Ronghao Zhang, Zhilin Wang, Huajian Zhang, Leyang Cui, Yongjing Yin, Tong Xiao, and Yue Zhang



Source link