In a February 26, 2025 paper, researchers from Tsinghua University and the University of Cambridge introduced something called LoRS-Merging (Low-Rank and Sparse Model Merging), a technique designed to improve multilingual speech recognition and translation without the need for full retraining.
By efficiently merging models trained on different languages or tasks, LoRS-Merging reduces computational costs, minimizes language interference, and improves scalability, addressing key challenges in automatic speech recognition (ASR) and AI speech translation (ST).
LoRS-Merging builds on recent efforts to improve multilingual speech processing without full retraining.
The researchers explained that traditional multilingual models such as OpenAI’s Whisper require extensive joint training across multiple languages, which is expensive and can lead to performance trade-offs.
LoRS-Merging eliminates this need by merging models trained on individual languages or tasks, while preserving essential structures and filtering out redundant parameters. Rather than retraining an entire model for each new language, LoRS-Merging selectively retains relevant information and discards unnecessary details to improve efficiency.
The approach applies low-rank pruning and sparse pruning, where the low-rank component retains effective parts of the model structure, while the sparse component removes unnecessary parameters, effectively reducing negative transfer effects.
The process begins with selecting a pretrained speech model, such as Whisper, as a foundation. The model is then fine-tuned separately for each language and task. After fine-tuning, redundant or conflicting parameters are removed to ensure smooth integration and prevent interference between languages. Finally, the refined models are merged into a single system, maintaining the strengths of each while improving efficiency.
Scalable and Effective Complement
The researchers tested LoRS-Merging on the CoVoST-2 dataset, covering ten languages — including high-resource languages such as Catalan, German, Spanish, French, and Italian, as well as low-resource languages such as Indonesian, Dutch, Portuguese, Russian, and Swedish.
Results showed a 10% reduction in word error rate (WER) for ASR and a 4% increase in BLEU score for ST compared to conventional multilingual training.
“Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for speech-to-text applications,” the researchers noted.
While the results are promising, the researchers acknowledge that challenges remain, particularly in adapting the method for models with different architectures. Future work will focus on refining LoRS-Merging to support even greater model diversity and to explore applications in spoken language understanding and speaker adaptation, they concluded.
Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, and Thomas Fang Zheng