As large language models (LLMs) continue to transform translation workflows, a new study underscores the ongoing importance of conventional, domain-specific machine translation (MT) models.
While recognizing the impact of LLMs on translation processes, the researchers emphasize the need for careful evaluation of workflows to ensure optimal outcomes.
Previous research has shown that MT systems often outperform LLMs in specialized domains like medical translation. Building on this, the latest study expands the scope by comparing open-source LLMs with task-oriented MT models, offering new insights into why MT models are still important for achieving high-quality translations in domain-specific contexts.
In their December 8, 2024 paper, researchers Aman Kassahun Wassie, Mahdi Molaei, and Yasmin Moslem compared the performance of open-source LLMs like Mistral and Llama with the multilingual encoder-decoder MT model NLLB-200 3.3B across four language pairs: English-to-French, English-to-Portuguese, English-to-Swahili, and Swahili-to-English.
They found that task-oriented MT models consistently outperformed general-purpose LLMs in specialized translation tasks, concluding that “task-oriented encoder-decoder MT models remain a core component in high-quality domain-specific translation workflows.”
According to the researchers, NLLB-200 3.3B surpassed LLMs in three of the four language pairs, with the exception of English-to-French translation, where the Llama-3 8B model either matched or slightly outperformed NLLB-200 3.3B in zero-shot settings. For English-to-Portuguese, English-to-Swahili, and Swahili-to-English translations, NLLB-200 3.3B delivered significantly better results.
“Our findings highlight the ongoing need for specialized MT models to achieve higher-quality domain-specific translation,” the researchers noted.
Talking to Slator, they clarified their findings do not advocate against the use of LLMs.They noted that several businesses have adopted self-hosted open-source LLMs for data privacy and security reasons. “However, in the rush to adopt LLMs, driven by media hype and promotional announcements, it is not clear what workflows should be followed,” they explained.
“Our study encourages thorough evaluation before completely discarding conventional, domain-specific MT models in favor of LLMs,” they said, noting that this is critical in highly regulated fields such as healthcare and law.
Diminishing Returns
The researchers also explored fine-tuning both LLMs and MT models. While fine-tuning improved the performance of LLMs like Llama-3 8B and Mistral 7B, these models still lagged behind fine-tuned NLLB-200 3.3B models. Fine-tuning NLLB-200 with medium-sized domain-specific datasets consistently produced higher scores across BLEU, chrF++, and COMET metrics.
“Domain-specific translation, e.g., in the medical domain, is challenging for general-purpose open-source LLMs,” the researchers said.
They further noted that fine-tuning task-oriented MT models, such as NLLB-200 3.3B, can be highly effective when domain-specific datasets are available. In such cases, relying on LLMs with around 8B parameters often leads to “diminishing returns,” according to the researchers.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
In the absence of domain-specific datasets, the researchers suggested that one-shot translation with decoder-only models, particularly larger LLMs like Mixtral 8x7B, Llama-3 70B, and Llama-3.1 405B, could be considered.
However, they pointed out that these larger models are less efficient for real-time production use due to their high computational demands. “Deploying Llama-3.1 405B in production can be challenging and inefficient,” the researchers noted, emphasizing the need for more scalable solutions.
Moreover, they urged exploring advanced workflows such as selecting the best translation based on domain-specific quality estimation and leveraging knowledge distillation from robust, larger LLMs to create efficient high-quality models.