In a September 5, 2024 paper, Alpha CRC and DCU researchers demonstrated that fine-tuning large language models (LLMs) with translation memories (TMs) can improve translation quality, reduce turnaround times, and offer cost savings.

They noted that out-of-the-box LLMs often fail to capture the nuances, tone, and specialized terminology necessary for specialized or organization-specific translations. “This is where TMs offer a potential solution,” they said.

The researchers explained that “by leveraging TMs, the model becomes more adept at recognizing and reproducing previously translated segments, their style, and terminology.” This fine-tuning process allows LLMs to adapt to specific domains and effectively translate content in highly specialized fields.

They emphasized that by leveraging previously human-translated material, language service providers (LSPs) can create custom translation models tailored to their needs. This not only improves translation quality but also helps LSPs obtain “the best possible return on investment” from their translation data.

In a “real-life scenario,” they fine-tuned Llama 3 8B Instruct using TMs from a software company and ran experiments across five translation directions: English > Brazilian Portuguese, Czech, German, Finnish, and Korean.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

The researchers explored various dataset sizes, ranging from 1k to over 200k translation segments, to assess their impact on translation quality and identify the most cost-efficient approaches, noting that “increasing the fine-tuning data requires dedicating more resources and time.”

They fine-tuned separate models for each training data set and evaluated their performance based on human evaluations (post-editing data and questionnaires) and automatic metrics (BLEU, chrF++, TER, and COMET). The results were compared to those from the baseline Llama 3 8B Instruct model and GPT-3.5.

“Extremely Promising” Results

They found that while smaller datasets (1k and 2k segments) underperformed compared to the baseline, performance improved and surpassed the baseline once dataset sizes reached 5k segments, with further improvements as the dataset size increased.

Although GPT-3.5 outperformed the fine-tuned Llama 3 8B Instruct model in BLEU and chrF++ for German and Finnish, models trained on the largest dataset models (100k+ segments) often surpassed GPT-3.5 in other languages and metrics.

This highlights “the value of fine-tuning LLMs with in-house TM data, particularly for specialized, domain-specific translations,” according to the researchers.

Additionally, they highlighted that creating custom models through fine-tuning “small business friendly models” like Llama 3 8B Instruct can be a cost-effective way for LSPs to leverage their existing translation resources, without the need for larger, general-purpose models like GPT-3.5.

“This is an […] approach that organizations could be pursuing in order to make the most out of their access to TMs and LLMs for MT in order to obtain the best possible return on investment when leveraging their previously human-translated material,” the researchers noted.

Inacio Vieira, NLP Engineer at Alpha CRC and DCU researcher, described the results as “extremely promising” in a LinkedIn post. “This approach presents a significant advantage for LSPs looking to train cost-effective translation models that can be run locally, ensuring both efficiency and the protection of their proprietary content,” he added.

Highest Return on Investment

Talking to Slator, Vieira highlighted that the highest return on investment for LSPs lies in low-resource languages. “Low resource languages seem to be perfectly placed to benefit from this method,” he said.

Notably, Korean — a low-resource language included in the study — demonstrated significant quality improvements, surpassing even high-resource languages. The researchers reported a 130% increase in the COMET score from the baseline to the 100k+ dataset, while the average increase for other target languages was only 46%.

Looking ahead, Vieira shared that future work will focus on testing Meta’s new Llama 3.1 model and training multilingual models. They also plan to implement a Retrieval-Augmented Generation (RAG) TM lookup system that would allow the model to leverage fuzzy matches enhancing its ability to generate accurate translations even when exact matches are not available.

Vieira also mentioned that Alpha CRC is already exploring the option of offering bespoke LLM fine-tuning as a service to its clients.

Authors: Inacio Vieira, Will Allred, Seamus Lankford, Sheila Castilho, Andy Way



Source link