In an October 4, 2024 paper, researchers from Johns Hopkins University and Microsoft, introduced X-ALMA, a large language model (LLM)-based multilingual translation model that delivers “top-tier performance” across 50 languages, regardless of resource availability.
While many multilingual LLMs attempt to support hundreds of languages, they often struggle to maintain quality, especially for mid- and low-resource languages, where “their performance […] falls short of practical application expectations,” the researchers explained. They also emphasized that this leads to “imbalanced performance heavily skewed in favor of high-resource languages.”
Even for high-resource languages, quality tends to decline when models are trained on too many languages — a problem known as the ‘curse of multilinguality’. As the researchers pointed out, in current state-of-the-art massively multilingual models, “overall quality decreases as the number of supported languages increases.”
X-ALMA takes a different approach by focusing on a set of 50 diverse languages, rather than attempting to scale to hundreds. “We prioritize quality over scaling the number of languages, with a focus on multilingual machine translation tasks,” the researchers said.
Multilingual models are usually heavily skewed in favor of high-resource languages.
We change this with X-ALMA: an LLM-based translator committed to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels!
Paper: https://t.co/O4M5LDGdAB pic.twitter.com/YtH5E4ThEG
— Haoran Xu (@fe1ixxu) October 7, 2024
Building on ALMA-R, previously recognized as “one of top-performing translation models built on LLMs, comparable to WMT winners and GPT-4-turbo,” X-ALMA extends support to an additional 44 languages.
A core innovation of X-ALMA is its ‘plug-and-play’ architecture, which minimizes negative language interference through language-specific modules. These modules are tailored to handle specific groups of languages. They can be activated individually — saving memory and computational power — or combined using a mixture-of-experts approach, allowing the model to adapt flexibly to different linguistic needs while maintaining high translation quality.
2024 Slator Pro Guide: Translation AI
The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.
To ensure top-tier performance, X-ALMA underwent a rigorous training process that consists of three pre-training stages and two post-training stages.
In the pre-training stages, the base model is trained on monolingual data, and language-specific modules are fine-tuned to their respective languages, ensuring they handle both high- and low-resource languages effectively. During the post-training stages, the model is further refined using high-quality translation data, followed by an optimization process called Adaptive-Rejection Preference Optimization (ARPO).
ARPO is an optimization method designed to tackle the ‘over-rejection’ issue common in traditional machine translation models. The researchers describe this as “a phenomenon where the writing style of the translation outputs is forced away from the preferred data distribution”. In simple terms, when two translations are very similar, traditional models tend to reject both options, even when one is clearly better. ARPO adjusts the rejection strength based on how close the translations are, ensuring that the model generates translations closer to the preferred outputs.
Slator 2024 Language Industry Market Report — Language AI Edition
The 140-page flagship report features in-depth market analysis, language AI opportunities, survey results, and much more.
When evaluated on the FLORES-200 and WMT’23 datasets, X-ALMA consistently outperformed other massively multilingual models, including NLLB-3.3B, LLaMAX3-Alpaca-8B, and Aya-101, across all language pairs in both directions (into and from English), as measured by the COMET-22 metric. It also surpassed high-resource language models like Aya-23-8B and Aya-23-35B.
“We tackled the challenge of achieving high translation quality while scaling to a large number of languages, a limitation seen in many state-of-the-art multilingual models,” the researchers noted.
The researchers have made the code and model checkpoints publicly available, contributing to the broader open-source community. The code is available on GitHub, and the models and datasets can be accessed on Hugging Face.
Authors: Haoran Xu, Kenton Murray, Philipp Koehn, Hieu Hoang, Akiko Eriguchi, and Huda Khayrallah