The First Large Language Model Supporting All EU Languages Is Here – slator.com

LOQUATICS NEWS READER

[ad_1]

On September 24, 2024, researchers from Unbabel, the University of Edinburgh, CentraleSupélec, and other partners introduced the EuroLLM project and released its first models — EuroLLM-1.7B and EuroLLM-1.7B-Instruct — as part of an open-weight, open-source suite of large language models (LLMs).

In a post on X, Pedro Martins, Senior AI Research Scientist at Unbabel, highlighted that the models can “understand and generate text in all EU languages.” Specifically, the models support 24 official EU languages and 11 other non-EU languages, including Arabic, Russian, Turkish, and Chinese. Manuel Faysse, Research Scientist at Illuin Technology, noted in another post on X, that EuroLLM has “a strong focus on multilinguality.”

The researchers explained that while models like OpenAI’s GPT-4 and Meta’s LLaMA have brought significant advancements, they remain largely focused on English and a few high-resource languages.

This leaves many languages underserved. To address this, the EuroLLM team aims to create “a suite of LLMs capable of understanding and generating text in all European Union languages […] as well as some additional relevant languages.”

EuroLLM-1.7B was trained on 4 trillion tokens divided across the considered languages and several data sources, including web data, parallel data (en-xx and xx-en), and high-quality datasets from various sources like Wikipedia and Arxiv.

The EuroLLM-1.7B-Instruct model was further instruction-tuned on EuroBlocks, an instruction-tuning dataset designed for general instruction-following and machine translation (MT).

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

The team evaluated the EuroLLM-1.7B-Instruct model on several MT benchmarks, including FLORES-200, WMT-23, and WMT-24, and compared it with Gemma-2B and Gemma-7B, both instruction-tuned on EuroBlocks. They used COMET-22 to evaluate the models’ MT performance.

Despite its small size, EuroLLM-1.7B-Instruct outperformed Gemma-2B-Instruct on all language pairs and datasets and remained competitive with Gemma-7B-Instruct.

Martins, in another X post, emphasized, “EuroLLM-1.7B excels at machine translation.” Faysse added, “For the small size, it really excels on translation tasks, which is super promising once we’ll scale up.”

While the models demonstrate strong translation capabilities, the researchers acknowledged that EuroLLM-1.7B hasn’t yet fully aligned with human preferences, which means it may occasionally produce problematic outputs, like hallucinations or inaccurate statements.

Looking ahead, the EuroLLM team plans to scale up the model and improve data quality. Both Martins and Ricardo Rei, Senior Research Scientists at Unbabel, confirmed this in posts on X, with Rei teasing “New models are coming (9B and 22B) as well as strong instruct models! Stay tuned!”

The EuroLLM models are now available on Hugging Face.

Authors: Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, and André F. T. Martins.

[ad_2]

Source link

News provided by