In an April 8, 2025 paper, researchers from Huawei and Soochow University proposed a new approach to improving document-level AI translation using large language models (LLMs).
Previously, the researchers suggested improving document-level AI translation by first promoting the LLM with a summary of the document and translations of key entity terms (e.g., names, places, events). These additions help the model maintain consistency and better understand the text’s overall context.
Now, they take it a step further by giving the LLM two different versions of the same translation — one generated sentence by sentence (Sent2Sent) and another generated as a full document (Doc2Doc) — and asking the model to produce an improved version that draws on both.
The researchers explained that each version has strengths and weaknesses. Sentence-level translations tend to be more fluent and accurate at the sentence level but often lack consistency across the document. For example, the same term might be translated differently from one sentence to the next. Document-level translations, by contrast, are more consistent and context-aware — but may omit details or entire phrases.
To address this, they combine the two outputs and let the LLM refine them into a single, better translation. “We propose finetuning LLMs for translation refinement using two intermediate translations, combining the strengths of both Sent2Sent and Doc2Doc,” the researchers said.
Better Quality
To test the effectiveness of their approach, the researchers fine-tuned two open-source models — LLaMA-3-8B-Instruct and Mistral-Nemo-Instruct.
To train the model, they used a dataset of source documents, the two intermediate translations (sentence-level and document-level), and a human reference translation. The LLM is trained to compare the two inputs and generate a better output.
To help the model focus on the parts that need the most improvement, the researchers introduced a quality-aware training method. Translations that are already close to the final version are given less importance during training, while more difficult or error-prone segments are given more weight — helping the model learn where improvements really matter.
The method was tested on ten language directions, including English > German, French, Chinese, and Russian. Across all tasks, the dual-translation refinement approach outperformed models trained to refine only one version of the translation, according to the researchers.
“Our refinement approach, based on the two intermediate translations […], significantly improves translation performance across all language pairs,” they said
For example, using this method, LLaMA-3-8B-Instruct gained up to +2.7 COMET points. Mistral-Nemo-Instruct showed similar improvements. This suggests that even smaller LLMs (7B parameters) can effectively refine translations when properly fine-tuned.
Moreover, the refined models also improved translations from other systems — including GPT-4o-mini and NLLB — showing that this approach can serve as a post-processing layer even for strong AI translation outputs.
The code is available on GitHub.
Authors: Yichen Dong, Xinglin Lyu, Junhui Li, Daimeng Wei, Min Zhang, Shimin Tao, and Hao Yang