A new white paper on large language model (LLM) development, the work of Stanford University with The Asia Foundation and the University of Pretoria, discusses best practices for improving the LLM landscape for low-resource languages.
In an April 22, 2025 press release, Stanford’s Human-Centered Artificial Intelligence (HAI) explains the “digital divide” of LLM development, in which major LLMs underperform for languages other than English, especially lower-resource languages. (The underperformance also extends to the understanding of relevant cultural contexts and accessibility in parts of the Global South, described as “technologically under-resourced geographies”.)
Authors Juan N. Pava, Carolina Meinhardt, Daniel Zhang, and Elena Cryst are affiliated with HAI; Haifa Badi Uz Zaman, previously with HAI, is now at The Rockefeller Foundation; and Sang T. Truong and Sanmi Koyejo are with Stanford University. The Asia Foundation’s Toni Friedman and the University of Pretoria’s Vukosi Marivate (a cofounder of Masakhane Research Foundation) also collaborated on the paper.
The group focused on quality and quantity — namely, poor quality data and the scarcity of language data, particularly labeled language data — as major causes of this underperformance, and made recommendations for stakeholders to support LLM development.
Languages with large speaking populations, such as Vietnamese and Hausa, can still be considered low-resource languages due to a lack of digital resources to support “advanced computational tasks.”
A “lack of sufficient AI literacy, talent, and computing resources” has resulted in most NLP research on Global South languages being conducted in Global North institutions, the paper notes, “where research biases often lead to low-resource language research needs being overlooked.”
The paper identifies several approaches to models for low-resource languages. So-called “massively multilingual models” perform better for individual languages than their monolingual counterparts — for speech translation.
The success of text translation models has been more mixed, and is compounded by the “curse of multilinguality,” the point at which improved performance in more languages comes at the expense of performance in other languages — often low-resource languages. In addition to the computational cost of increasing model size, this makes massively multilingual models somewhat impractical and inaccessible for small, under-resourced research teams.
A Fork in the Road
As an alternative to massively multilingual models, the authors explore two different strategies: regional multilingual models and monolingual, monocultural models.
They share the same possible broad approaches. Researchers can either use the architecture of foundation models (often BERT models) to train a new model from scratch or fine-tune an off-the-shelf foundational model on one (or more) low-resource languages.
Regional multilingual models tend to be smaller, are frequently developed outside the private sector, and are generally trained on multilingual data from 10-20 languages, usually grouped based on geographical or linguistic proximity.
One example of this kind of model is Southeast Asian Languages in One Network, known as SEA-LION, a project spearheaded by Singapore’s national R&D program, AI Singapore, with support from private companies, such as AWS, Google Research, and IBM. The model now covers 13 high- and low-resource languages prevalent in Southeast Asia, including English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao, Javanese, and Sundanese.
Both public and private institutions have worked to develop monolingual and monocultural models, which avoid the “curse of multilinguality” by dedicating an entire model to a target language. These include SwahBERT and UlizaLlama for Swahili; Typhoon for Thai; and IndoBERT for Indonesian.
A joint initiative by Stanford and VNU-HCM University of Technology developed five Vietnamese-focused LLMs by fine-tuning a number of off-the-shelf models with additional Vietnamese data. Despite the scarcity of training datasets and limited computational resources, the researchers were able to train high-performing modes.
Still, these methods are not a silver bullet. The authors note that, for languages with limited data available, massively multilingual models sometimes outperform monolingual models fine-tuned from foundation models. The lowest-resource languages may not have enough data to efficiently train a monolingual model. In other words, data scarcity remains a key bottleneck in advancing AI capabilities in Global South countries.
Creating Data from Scratch
To overcome data scarcity, stakeholders from large US-based firms, Global South NLP communities, and governments are focusing their efforts on large-scale data production through machine translation (MT).
One option is translating text from English (or another high-resource language) to the target low-resource language.
This can be done via the “translate-train approach,” where existing MT models with strong low-resource language translation capabilities translate texts to be used to fine-tune a multilingual model for a given task. (This data is often combined with “real-world” data from the target language.)
Alternatively, via the “translate-test method,” text is translated from a low-resource language into English, after which an English-only model is used for a desired task.
Both approaches have limitations, particularly due to MT often missing important contextual knowledge and linguistic nuances. This can result in the unnatural language patterns known as “translationese”; frustratingly, these problems are often inconsistent across languages, making them difficult to address systematically.
Another option for producing diverse, high-quality annotated datasets is using different machine learning approaches or engaging native speakers of low-resource languages through crowdsourcing.
A machine learning approach aims to automate, or semi-automate, and streamline the process of labeling data (for instance, by developing a set of labeling rules to automatically annotate certain data). However, the success of this method still depends on the quantity and quality of supplementary data sources, such as bilingual datasets, domain experts, and dictionaries.
A participatory approach, meanwhile, involves native speakers throughout the AI development lifecycle, rather than just in data labeling. Both Microsoft’s ELLORA and the Google-funded project Vaani saw researchers collaborate with speakers of low-resource languages in India.
‘Bridging the Gap’
The authors close their paper with three “overarching recommendations” to help close the LLM divide.
Strategic investments in AI research and development for low-resource languages can be a game-changer. This could mean subsidizing access to computing and cloud resources; funding research initiatives to improve the availability and quality of low-resource language data; or incentivizing cross-disciplinary work on data and computational limitations.
The paper endorses participatory research as contributing to more inclusive AI development. Direct collaboration with communities that speak low-resource languages could include co-designing datasets and deciding, together, on labeling schemes and evaluation methods. Communities can contribute and even co-own the creation of AI resources.
Lastly, the authors emphasize “equitable data ownership,” which they describe as a global problem touching issues such as consent, copyright, and fair compensation. Incentivizing and supporting “rights-respecting licensing frameworks that facilitate AI development” can help establish fair compensation structures for data contributors, among other benefits.