Two recent papers explore how question answering (QA) can serve as a tool for automatic AI translation evaluation, pointing to a shift in how the language industry might approach translation quality assessment.
The first paper from the University of Maryland and Johns Hopkins University, introduces ASKQE, a framework that generates factual questions from the source text and checks whether those questions can be answered consistently after translation and back-translation.
It is based on the idea that if key questions about the source yield different answers when derived from the source or the back-translation, the translation is unreliable.
According to the researchers, this approach is designed to help monolingual speakers — who lack knowledge of the target language — determine whether an AI translation is good enough by identifying critical translation errors and providing actionable feedback.
Accurately assessing AI translation quality is “significantly more challenging for monolingual speakers than for bilinguals, as they rely on translations in a language they do not understand,” they noted.
Potential in High-Stakes Contexts
The researchers found that ASKQE can effectively distinguish minor from critical errors and aligns well with standard quality estimation (QE) metrics. Specifically, ASKQE achieved the highest decision accuracy by being less focused on surface issues (like spelling) and more effective at identifying mistranslations or stylistic flaws that affect meaning.
It also delivered more actionable, interpretable feedback, highlighting “ASKQE’s potential not only for MT quality assessment but also as actionable feedback to support real-world decision making in high-stakes contexts.”
“We demonstrate that using ASKQE feedback achieves higher decision accuracy than other QE metrics,” they noted.
The researchers have released their code and dataset to promote further research.
A More Pragmatic Approach
The second paper, by researchers at Carnegie Mellon University, Instituto de Telecomunicações, Instituto Superior Técnico, and Unbabel, extends the QA-based approach beyond the sentence level to entire paragraphs.
Their proposed method, TREQA (Translation Evaluation via Question Answering), generates comprehension questions over entire paragraphs and tests whether they can still be answered using the AI-translated text.
Unlike most existing metrics, which focus on sentence level, TREQA captures discourse-level preservation of meaning — crucial for evaluating legal, scientific, or literary texts where context and coherence matter.
The researchers argue that it’s time to revisit “pragmatic,” task-centered evaluation, which looks not just at how a translation reads, but how effectively it enables readers to use the conveyed information.
“Evaluating translations beyond individual sentences requires taking a more pragmatic approach that considers how well a translation serves its intended purpose — i.e., how well it enables readers to use the information as effectively as readers of the source text,” they noted.
More Informative and Practically Useful
While earlier attempts at such extrinsic evaluations — where the quality of the translation is tested through reading comprehension questions rather than merely predicting a single score — were abandoned due to the labor-intensive need for human-generated questions and answer-checking, the authors note that large language models (LLMs) can now automate much of this process.
Their experiments show TREQA competing with, or even outperforming, state-of-the-art neural and LLM-based metrics in both reference-based and reference-free setups, all while remaining fully unsupervised — no training on human quality assessments is needed.
“Our framework paves the way for future research towards more informative and practically useful translation evaluation methodologies,” the researchers concluded.