Microsoft and Johns Hopkins Unveil Multilingual AI Translation Model for 50 Languages – slator.com

In an October 4, 2024 paper, researchers from Johns Hopkins University and Microsoft, introduced X-ALMA, a large language model (LLM)-based multilingual translation model that delivers “top-tier performance” across 50 languages, regardless of resource availability.

While many multilingual LLMs attempt to support hundreds of languages, they often struggle to maintain quality, especially for mid- and low-resource languages, where “their performance […] falls short of practical application expectations,” the researchers explained. They also emphasized that this leads to “imbalanced performance heavily skewed in favor of high-resource languages.”

Even for high-resource languages, quality tends to decline when models are trained on too many languages — a problem known as the ‘curse of multilinguality’. As the researchers pointed out, in current state-of-the-art massively multilingual models, “overall quality decreases as the number of supported languages increases.”

X-ALMA takes a different approach by focusing on a set of 50 diverse languages, rather than attempting to scale to hundreds. “We prioritize quality over scaling the number of languages, with a focus on multilingual machine translation tasks,” the researchers said.

Multilingual models are usually heavily skewed in favor of high-resource languages.

We change this with X-ALMA: an LLM-based translator committed to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels!

Paper: https://t.co/O4M5LDGdAB pic.twitter.com/YtH5E4ThEG

— Haoran Xu (@fe1ixxu) October 7, 2024

Building on ALMA-R, previously recognized as “one of top-performing translation models built on LLMs, comparable to WMT winners and GPT-4-turbo,” X-ALMA extends support to an additional 44 languages.

A core innovation of X-ALMA is its ‘plug-and-play’ architecture, which minimizes negative language interference through language-specific modules. These modules are tailored to handle specific groups of languages. They can be activated individually — saving memory and computational power — or combined using a mixture-of-experts approach, allowing the model to adapt flexibly to different linguistic needs while maintaining high translation quality.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

To ensure top-tier performance, X-ALMA underwent a rigorous training process that consists of three pre-training stages and two post-training stages.

In the pre-training stages, the base model is trained on monolingual data, and language-specific modules are fine-tuned to their respective languages, ensuring they handle both high- and low-resource languages effectively. During the post-training stages, the model is further refined using high-quality translation data, followed by an optimization process called Adaptive-Rejection Preference Optimization (ARPO).

ARPO is an optimization method designed to tackle the ‘over-rejection’ issue common in traditional machine translation models. The researchers describe this as “a phenomenon where the writing style of the translation outputs is forced away from the preferred data distribution”. In simple terms, when two translations are very similar, traditional models tend to reject both options, even when one is clearly better. ARPO adjusts the rejection strength based on how close the translations are, ensuring that the model generates translations closer to the preferred outputs.

Slator 2024 Language Industry Market Report — Language AI Edition

The 140-page flagship report features in-depth market analysis, language AI opportunities, survey results, and much more.

When evaluated on the FLORES-200 and WMT’23 datasets, X-ALMA consistently outperformed other massively multilingual models, including NLLB-3.3B, LLaMAX3-Alpaca-8B, and Aya-101, across all language pairs in both directions (into and from English), as measured by the COMET-22 metric. It also surpassed high-resource language models like Aya-23-8B and Aya-23-35B.

“We tackled the challenge of achieving high translation quality while scaling to a large number of languages, a limitation seen in many state-of-the-art multilingual models,” the researchers noted.

The researchers have made the code and model checkpoints publicly available, contributing to the broader open-source community. The code is available on GitHub, and the models and datasets can be accessed on Hugging Face.

Authors: Haoran Xu, Kenton Murray, Philipp Koehn, Hieu Hoang, Akiko Eriguchi, and Huda Khayrallah

Source link

Tagged Large Language Model, LLM, LLMs, Machine Translation, MT

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

dfinck@loquatics.com

loquatics.com

linkedin.com/in/dfinck/

Berlin, Germany

Get In Touch

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

dfinck@loquatics.com

loquatics.com

linkedin.com/in/dfinck/

Berlin, Germany

Get In Touch

2024 Slator Pro Guide: Translation AI

Slator 2024 Language Industry Market Report — Language AI Edition

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

Get In Touch

Login

DANIEL FINCK

localization

manager · Engineer · consultant

+49 (0) 30 54871960

Get In Touch

Login

2024 Slator Pro Guide: Translation AI

Slator 2024 Language Industry Market Report — Language AI Edition

Login

Don't need to reset? Login

Forgot Password?