As large language models (LLMs) gain prominence as state-of-the-art evaluators, prompt-based evaluation methods like GEMBA-MQM have emerged as powerful tools for assessing translation quality.

However, LLM-based evaluation is expensive and computationally demanding, requiring vast amounts of tokens and incurring significant API call expenses. Scaling evaluation to large datasets quickly becomes impractical, raising a key question: Can the industry reduce evaluation costs without sacrificing quality?

In a March 4, 2025 paper, researchers Daniil Larionov and Steffen Eger from the Natural Language Learning & Generation (NLLG) Lab at the University of Technology Nuremberg explained that traditional LLM-based evaluation usually operates on a single-example prompting basis, where each translation is assessed independently.

“This approach, while effective, can be inefficient in terms of token usage and computational resources, especially when scaling to large datasets of multiple language pairs,” they noted.

To address this inefficiency, the researchers proposed a combination of batched prompting and prompt compression — the latter being a method that removes redundant tokens while preserving essential information.

To test the effectiveness of a “batching-aware prompt compression strategy,” they extended the conventional GEMBA-MQM prompt — originally designed for single-example, few-shot evaluation — into a batched format, called BatchGEMBA-MQM. This new format enables the evaluation of multiple translation examples within a single prompt. The researchers evaluated its performance both with and without the application of prompt compression.

Their experiments involved multiple batch sizes (1, 2, 4, 8, and 16) and a range of LLMs, including OpenAI’s GPT-4o and GPT-4o-mini, Mistral AI’s Mistral Small, Microsoft’s Phi4, and Cohere’s CommandR7B.

Their findings suggest that it is possible to reduce computational overhead while maintaining evaluation accuracy.

Batching Reduces Evaluation Quality 

One of the key questions the researchers sought to answer was whether evaluation quality would suffer as batch size increased.

They found that batching generally reduces correlation with human judgments, meaning that evaluation quality tends to decline as batch size increases. However, the extent of this decline varies by model.

Among the models tested, GPT-4o-mini proved to be the most resilient. GPT-4o performed well at moderate batch sizes but experienced a sharp decline at higher batch sizes, while smaller models like Mistral Small degraded more quickly, suggesting that certain architectures struggle to process multiple translations within a single prompt.

The researchers noted that “the architecture and training of each LLM play a significant role,” and that “batching can stress the evaluation capabilities of certain systems.”

Compression Mitigates Quality Loss

According to the researchers, prompt compression could help mitigate the loss of evaluation quality in batched settings.

In the case of GPT-4o at batch size 4, compression preserved over 90% of its baseline accuracy, compared to a 44.6% drop without it. “While batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss,” they noted.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

Misformated Output

Beyond accuracy, the researchers also examined how batching and compression impact an LLM’s ability to deliver evaluations, measuring cases where output formatting errors (i.e., misformatted output) led to failed responses.

They found that GPT-4o-mini and GPT-4o generally show high format consistency even when processing multiple examples at once. GPT-4o maintained near-perfect performance in non-compressed settings and only showed difficulties at batch size 8 when compression was applied.

Mistral Small and Phi4 exhibited modest error rates that varied with batch size. CommandR7B showed an unusual pattern — struggled with single-example formats but performed surprisingly well with batched inputs.

Resource-Efficient and Scalable LLM-based Evaluation

Regarding costs, Larionov and Eger found that the combined use of batching and compression led to a massive reduction in token consumption. 

Batching alone cut token usage, while compression provided an additional 40–60% reduction. At batch size 16 with compression, token usage dropped by 95%.

Considering that running GEMBA-MQM on 60,000 translations could cost nearly $1,000 in API fees, according to the researchers, these efficiency gains make high-quality LLM-based evaluation far more feasible for large-scale deployments.

“Our study provides a pathway towards more resource-efficient and scalable LLM-based evaluation for machine translation,” they concluded.

The researchers have open-sourced their work to support further exploration into batched prompting and prompt compression.



Source link