On February 17, 2025, Google released SMOL (Set of Maximal Overall Leverage), a dataset translated by professional translators aimed at improving machine translation (MT) for 115 low-resource languages (LRLs).
SMOL consists of two components: SMOLSENT, a collection of 863 English sentences translated into 81 languages, and SMOLDOC, a dataset of 584 English documents translated into 100 languages, covering diverse topics.
According to the researchers, the dataset, which includes 6.1 million translated tokens, is a key resource for expanding multilingual AI capabilities, particularly for languages “for which there exist no previous public resources.”
“Our released SMOL dataset will assist in development and evaluation of machine translation and multilingual AI capabilities for low-resource languages,” they noted.
They explained that most MT datasets focus on high-resource languages, leaving many of the world’s 7,000 languages underserved. SMOL aims to bridge this gap by providing “professionally translated sentence- and document-level data.”
Together with GATITOS, Google’s previously released token-level dataset, SMOL forms a comprehensive three-tiered resource — covering tokens, sentences, and full documents.
Unlike many LRL datasets, SMOL includes factuality annotations as well, with the researchers noting that “in addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.”
Robust Improvements
To evaluate SMOL’s effectiveness, the researchers fine-tuned Google’s Gemini 2.0 Flash model, demonstrating significant translation quality improvements.
“We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust […] improvements,” they said.
They noted that combining SMOLSENT and SMOLDOC led to notable improvements in translation quality. When GATITOS was included in the fine-tuning process, the gains further increased, resulting in performance that surpassed all other 0-shot baselines except Google Translate for the languages it supports.
They also found that few-shot prompting using SMOL-based examples yielded similar improvements to fine-tuning, presenting two effective ways to improve LRL translations.
Open Access
While SMOL marks an important step in making MT more inclusive, the authors acknowledge that “future work on SMOL-like datasets should also focus on non-English source text that is not only maximally authentic in the given language but also covers the topics and concepts most relevant to those languages.”
The dataset is open-source and available on Hugging Face, with the researchers inviting contributions, corrections, and suggestions, particularly from native speakers of the included languages. They also plan to update the repository periodically with corrections or additional translations.
Authors: Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry, Geza Kovacs, Hadar Shemtov, Partha Talukdar, Dinesh Tewari, Baba Mamadi Diane, Koulako Moussa Doumbouya, Djibrila Diane, and Solo Farabado Cissé