On February 6, 2025, Meta unveiled BOUQuET, a comprehensive dataset and benchmarking initiative aimed at improving multilingual machine translation (MT) evaluation. 

This development aligns with Meta’s ongoing efforts to source diverse AI translation data through collaborative partnerships.

The researchers noted that existing datasets and benchmarks often fall short due to their English-centric focus, narrow range of registers, reliance on automated data extraction, and limited language coverage. These constraints hinder the ability to fairly evaluate translation quality across diverse linguistic contexts.

BOUQuET addresses these gaps by shifting away from English-centric benchmarks. Instead, it originates content in seven non-English languages — French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish — before translating into English. 

According to the research team, “BOUQuET is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features.” This ensures a more comprehensive evaluation of AI translation models across different linguistic structures and cultural contexts.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

The dataset spans eight distinct domains, including fiction, conversation, social media posts and comments, websites, tutorials, and opinion pieces. It also captures various registers, from formal discourse to colloquial speech. 

Unlike many benchmarks that rely on automated data collection, BOUQuET comprises manually created and meticulously reviewed paragraphs. Source-BOUQuET is developed by proficient speakers of the included languages, following detailed linguistic guidelines.

Meta encourages contributions from global language communities, aiming to expand the dataset to include translations “into any written language.” Marta R. Costa-jussà, a Research Scientist at Meta, wrote in a post on X, “Let’s make MT available for any written language!”

The researchers emphasized that “this ambition can only be achieved with the support of the community.”

Meta plans to expand BOUQuET with additional languages, starting with Egyptian Arabic, and refine evaluation methodologies to better capture domain-specific translation challenges.

Authors: Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussà, Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alex Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo Sánchez, Ioannis Tsiamas, Arina Turkatenko, Albert Ventayol-Boada, and Shireen Yates



Source link