AI translation has become an integral part of translation workflows, with post-editing being a standard practice in the language industry. 

In this context, quality estimation (QE) has been proposed and integrated into many computer-assisted translation (CAT) tools to direct post-editors’ attention to segments that need revision.

While segment-level QE enables translators to focus on problematic segments, word-level QE provides a more granular approach. By highlighting specific words or phrases within a segment that might contain errors, word-level QE offers a promising way to pinpoint issues within a sentence — ideally reducing cognitive load and improving efficiency.

However, researchers from the University of Groningen, ETH Zürich, and Tilburg University noted in a March 4, 2025 paper that there has been limited research on how word-level QE affects human post-editing in real-world scenarios. 

“​​While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied,” they said.

To address this gap, they investigated the real-world impact of word-level QE on translation quality, productivity, and usability. They tested four QE highlight modalities — no highlights, unsupervised, supervised, and oracle highlights — on 42 professional translators working on English>Italian and English>Dutch translation tasks. 

They measured post-editing effort, productivity, quality, and usability through behavioral logs, questionnaires, and human and automatic evaluations. The experiments were conducted using GROTE, an online editing tool designed for this study that supports real-time data logging.

Quality Gains but No Speed Boost

The researchers found that QE highlights — even when imperfect — can effectively direct post-editing efforts toward problematic areas. Translators corrected 16–20% more critical errors when working with highlighted texts compared to the no-highlight baseline. 

“The presence of highlights might result in narrow but tangible quality improvements that remain undetected in coarser assessments,” they said.

Contrary to expectations, the presence of highlights did not consistently speed up the editing process. In some cases, lower-quality QE highlights slowed down post-editors by increasing cognitive load. 

The proportion of highlights was a key factor in increased editing times — too many made it harder for post-editors to process, likely due to the increased cognitive load of processing extra information. Translation direction and domain also played a role.

More of a Distraction Than a Help?

Despite the quality benefits, post-editors found the highlights distracting rather than helpful. Many participants reported that the highlights were “more of an eye distraction,” “not quite accurate enough to rely on” or “not actual mistakes”. Some even ignored the highlights entirely, preferring a manual review process instead.

“These comments convincingly point to a negative perception of the quality and usefulness of highlights, suggesting that improvement in QE accuracy may not be sufficient to improve QE usefulness in editors’ eyes,” the researchers said.

“It’s better to let the professional human translators do their work, without distractions or biases.” — Adam Bittlingmayer, CEO, ModelFront

Adam Bittlingmayer, CEO of ModelFront, echoed this concern. Talking to Slator, he explained that ModelFront’s customers use AI to fully automate large-scale translations by automatically verifying as many segments as safely possible, so that they can skip manual human post-editing. “The segments that do still go to manual human post-editing, are never going to be super efficient, but they’re key to monitoring and training AI systems,” he said.

In that sense, Bittlingmayer argues, “it’s better to let the professional human translators do their work, without distractions or biases.”

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

Unbabel’s Take

With Unbabel’s Widn.ai being — for now — the only commercial solution returning error spans, Ricardo Rei, Senior Research Scientist at Unbabel, also talked to Slator and provided additional insights.

Since the launch of Widn.ai, Unbabel’s community of post-editors has been using its “quality highlights” reporting positive results regarding quality. “Word-level QE highlights improved overall quality of the delivered translations,” Rei said, echoing the study’s finding that word-level QE can improve quality by effectively directing post-editing efforts. 

Although Unbabel has not yet collected data on edit time, Rei acknowledged that the impact of word-level QE on post-editing speed depends on both content type — marketing content can trigger false positives due to non-literal translations, distracting and slowing post-editors, while QE performs well for customer support and news content — and user interface (UI) design.

“The way errors are highlighted is crucial,” Rei said, confirming that “if the highlights are too intrusive, they may become a distraction.” 

“[Word-level QE is a] valuable tool beyond just assisting human translators.” — Ricardo Rei, Senior Research Scientist, Unbabel

To mitigate this, for more challenging content types Unbabel only displays QE results when the model has high confidence. This means they filter out lower-confidence predictions, reducing the number of flagged errors overall but ensuring that the highlighted issues are more likely to be correct.

Beyond post-editing, Rei pointed to other potential applications of word-level QE, including AI post-editing and reinforcement learning, calling word-level QE a “valuable tool beyond just assisting human translators.”

UX Matters

While QE remains an important component of post-editing workflows, the findings of this study suggest that usability improvements are just as important as technical accuracy.

The researchers noted that their work aligns with “recent calls for an evaluation of translation technologies that is centered on users’ experience” suggesting that future developments should focus on improving the usability of these methods in editing interfaces to ensure they genuinely assist rather than disrupt the editing process.

To promote further research, they have released their data, code, and the GROTE editing interface, enabling others to replicate and build on their findings.

Authors: Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, and Arianna Bisazza



Source link