On September 9, 2023, Google released a 10.7 billion parameter multilingual machine translation (MT) model trained on a new dataset called MADLAD-400 (Multilingual And Document-Level Large Audited Dataset).
MADLAD-400 is a manually audited, general domain monolingual document-level dataset covering 419 languages. The objective behind creating MADLAD-400 was to provide valuable training data for multilingual natural language processing (NLP) tasks like MT and language modeling.
The motivation behind MADLAD-400 was to address linguistic diversity, recognizing that a substantial portion of the world’s population speaks languages that are not adequately covered by mainstream models, both within Google and in the wider research community. “Most publicly available general-domain multilingual corpora contain 100-200 languages, with some datasets containing more languages in specific domains,” explained the authors.
MADLAD-400 seeks to bridge this gap by offering data for a much larger and more diverse set of languages. “Our expectation is that releasing MADLAD-400 will foster progress on the language research, especially on medium and low resource languages,” said the authors.
MADLAD-400 itself is an extensive dataset, comprising 4.0 billion documents, totaling 100 billion sentences or 2.8 trillion tokens across the 419 languages. Even though there is considerable variation in data availability for different languages, the median language in the dataset contains 1.7 thousand documents, amounting to 73 thousand sentences and 1.2 million tokens.
Excited to announce MADLAD-400 – a 2.8T token web-domain dataset that covers 419 languages(!).
Arxiv: https://t.co/Y48Bsw952P
Github: https://t.co/ANNVWvfAk2 1/n pic.twitter.com/5t3gIIgTbw— Sneha Kudugunta (@snehaark) September 12, 2023
To construct this dataset, the authors employed a two-step process. First, they used a document-level “Language Identification” model to identify and annotate data from CommonCrawl, a web-scale repository. Recognizing the noisy nature of web-scale corpora, they performed manual inspections and preprocessing to enhance data quality.
The authors self-audited the initial dataset, while native speaker volunteers were also engaged in some cases to provide insights on the quality of the dataset. As a result of their findings, 79 out of the initial 498 languages were removed from the initial dataset.
Competitive Performance
To validate the efficacy of MADLAD-400, the authors trained and released multilingual MT models of varying sizes, up to 10.7 billion parameters, as well as an 8 billion parameters decoder-only model. More specifically, they trained a 3 billion 32-layer parameter model, a 7.2 billion 48-layer parameter model, and a 10.7 billion 32-layer parameter model. The MT models were trained not only on MADLAD-400 but also on publicly available parallel data covering 157 languages.
These models were extensively evaluated using diverse multilingual translation evaluation sets, such as WMT, NTREX, Flores-200, and Gatones, employing established metrics like SacreBLEU and chrF. Impressively, the 10.7 billion parameters MT model was “competitive with models that are significantly larger,” according to the authors.
“We trained these models with MADLAD-400 and publicly available data to create baseline models that support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora,” said the authors.
However, the authors noted that these models are intended primarily for research purposes and might not be suitable for domain-specific applications out of the box. Additionally, they have not undergone assessment for production-level use cases.
The baseline models were made available to the research community on GitHub.
Authors: Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat