In a May 20, 2025 paper, researchers from Alibaba and Beijing Language and Culture University introduced TransBench, a benchmark designed to evaluate AI translation systems for real-world industrial applications — starting with international e-commerce.
The researchers argue that existing benchmarks and automatic metrics fall short when it comes to assessing performance in specialized domains, where accurate terminology, domain style, and cultural nuances are essential. This is because they often lack realistic, domain- and culturally-representative data.
As a result, there is a “significant evaluation gap” between observed performance on standard benchmarks and real-world effectiveness, the researchers noted, making it difficult for researchers and practitioners to accurately assess and improve AI translation for industry-specific use cases.
TransBench aims to close this gap with a multi-level framework and datasets that reflect actual industrial use cases. While the initial release focuses on e-commerce, the team plans to extend coverage to other high-impact sectors, including finance and legal translation.
Holistic Evaluation
At the heart of TransBench is a framework that evaluates translation quality across three interconnected dimensions: basic linguistic competence, domain-specific proficiency, and cultural adaptation — emphasizing the need for “holistic evaluation.”
Basic linguistic competence focuses on grammatical correctness, fluency, and basic vocabulary mapping. Domain-specific proficiency evaluates whether models correctly apply terminology, style, and context within specialized domains. Cultural adaptation measures how well systems reflect local norms — including tone, politeness, and the appropriate handling of honorifics and culturally sensitive content.
“This framework posits that effective industrial translation goes beyond mere linguistic transfer, necessitating proficiency across distinct yet interconnected levels,” the researchers said.
In line with that goal, TransBench introduces evaluation indicators that go beyond traditional metrics. These include hallucination rate, which measures how often models fabricate content not present in the source; taboo term detection, which evaluates whether outputs avoid culturally inappropriate language; and honorific norms, which assess whether the formality level is suitable for the target audience.
Comparative Model Assessment
In addition to offering a structured evaluation approach, TransBench also supports comparative model assessment. As of May 2025, GPT-4o ranks first overall, followed closely by DeepL Translate and GPT-4-Turbo. DeepSeek-R1 performs particularly well in e-commerce. Qwen series models lead in cultural adaptation, while Claude 3.5 Sonnet and DeepSeek-V3 stand out in Chinese translation tasks.
To support transparency and industry-wide adoption, the researchers have open-sourced the benchmark’s construction guidelines and datasets. They also encourage contributions and participation from across the industry to support horizontal comparisons and help establish more robust standards.
Authors: Haijun Li, Tianqi Shi, Zifu Shang, Yuxuan Han, Xueyu Zhao, Hao Wang, Yu Qian, Zhiqiang Qian, Linlong Xu, Minghao Wu, Chenyang Lyu, Longyue Wang, Gongbo Tang, Weihua Luo, Zhao Xu, and Kaifu Zhang