In a paper for the 2024 conference of the European Association for Machine Translation (EAMT), interpreting technology researchers Xiaoman Wang (University of Leeds) and Claudio Fantinuoli (University of Mainz and CTO at Kudo) find that prompting Open AI’s GPT-3.5 to assess the quality of translated speech approximates human evaluations.

The research explores the correlation between automated metrics and expert evaluations of both human simultaneous interpreting and AI speech translation, and suggests that the large language model (LLM) positively aligns with human scores across various evaluation methods.

Measuring interpreting quality with the assistance of artificial intelligence (AI) could prove to be helpful for professional interpreters, interpreter trainers, and students, as well as machine speech translation developers, as a tool to improve performance.

Assessing the quality of simultaneous interpreting is a complex task. This is due to the layered nuances of real-time multilingual communication. Additionally, interpreters often adopt non-linear strategies such as rephrasing, adaptation, and expansion. These strategies help deliver messages in contextually appropriate ways.

Still, interpreting quality assessment can provide valuable insights for practitioners, educators, and trainees, as well as scholars, certification bodies, and even clients and end users.

However, like any human evaluation, it is also time-consuming and resource-intensive and thus carried out only in limited scenarios.

The two researchers conducted a preliminary study to investigate the reliability of automated metrics for evaluating simultaneous interpreting. They measured the connection between automatic assessment results and expert-curated evaluations. These evaluations focused on a single feature: the accuracy of meaning transferred from one language to another.

Interpretations provided by three expert interpreters and a speech translation engine (Kudo’s AI Speech Translation, whose production is led by Fantinuoli himself) were put to the test. 

AI Rivals Human Evaluations

The researchers first used a team of 18 professional interpreters and bilingual individuals to manually rate the transcriptions of human and machine interpretations from English into Spanish of 12 real-life speeches.

The human assessment was taken as a benchmark for comparison with the automated evaluation metrics. The raters only focused on measuring the faithfulness of information rendered across the two languages. All the other features of interpreting and oral speech were not taken into account.

While the evaluators were not aware if the transcription they were rating had been produced by a human or a machine, scoring agreement among them was very fluctuating and generally low.

This attests to the complexity of gauging interpretation activities, as well as the degree of subjectivity entailed in determining what constitutes a good rendition of a speech.

The researchers then considered various automatic metrics to calculate the semantic similarity (i.e., the correspondence of concepts) between the transcribed source and target speeches.

Among these methods, Wang and Fantinuoli looked at how an LLM, namely OpenAI’s GPT-3.5 performs in such a task.

They found that when prompted directly (“Given the two sentences in English and Spanish, rate from 1 to 5 their similarity, where 1 is not similar and 5 very similar”), this shows a high correlation with human judgments, benefitting from its large context window.

The research dataset is available on GitHub, and the transcriptions of the renditions with the assigned scores can also be accessed and consulted in easy-to-read spreadsheet and PDF formats.

Applications for Interpreters and LSPs

The integration of AI-enabled quality evaluation can offer interpreting of new resources and perspectives.

Interpreters may apply AI feedback for various aspects of their renditions for continued professional development. If such insights were provided in real time, interpreters could even make on-the-fly adjustments and enhance their overall performance.

Likewise, interpreter trainers and students could use automated quality evaluation as an additional resource to further elaborate on interpreting processes in the classroom.

Designers of speech translation systems may also find applications of automatic evaluations to streamline the assessment of their own technology, thus accelerating development cycles.

Nevertheless, at this stage, the mechanism still only offers approximate and constrained estimation capabilities. Hence, all review and assessment processes should be performed under expert guidance to compensate for the shortcomings of the fully-automated approach.

Despite advances in context-aware language models, these still lack a holistic understanding of in-setting and socially-situated quality indeed.

Therefore, stakeholders contracting or evaluating interpreting services (LSPs, institutions, organizations, accreditation entities) cannot consider this as a standalone solution to consistently and objectively measure, examine, or monitor interpreting quality.

The notable limitations of the specific study in terms of scope, range, language coverage, and domains allow for very little generalization of findings and considerations.

Additionally, an analysis of interpreting through text and transcriptions only cannot account for the entire set of oral features characterizing spoken language. The perception of end users in terms of information retention, intelligibility, and ultimately communicative effectiveness is not considered either.

As the authors themselves note at the end of the paper, “before these metrics can be used in production, more research needs to be conducted.” 



Source link