In a July 11, 2024 paper, Alibaba Group’s Tongyi SpeechTeam presented FunAudioLLM, a large language model (LLM) family that integrates voice understanding and generation technologies to enable natural, speech-driven interactions.
The researchers explained that recent advancements in artificial intelligence (AI) have transformed how humans interact with machines. Their key focus here is “to enhance natural voice interactions between humans and LLMs” by developing models that can effectively process and generate speech.
Specifically, the researchers aimed to create models that could understand and generate speech, not just text. This would allow for more natural, hands-free interactions between humans and AI systems.
This FunAudioLLM framework is built upon two core models: SenseVoice, a voice model for multilingual speech recognition and emotion detection, and CosyVoice, a text-to-speech synthesizer for speech generation. “FunAudioLLM leverages the strengths of SenseVoice and CosyVoice to push the boundaries of voice interaction technology, enabling more natural and seamless communication between humans and large language models,” said the researchers.
FunAudioLLM is designed to improve a variety of voice interaction applications, including:
- Speech-to-Speech Translation — facilitating real-time interpreting while preserving the speaker’s voice characteristics.
- Emotional Voice Chat — enabling more natural and emotionally aware conversational agents.
- Interactive Podcasts — allowing dynamic and engaging live discussions with AI models.
- Expressive Audiobook Narration — providing rich, multi-character narration for audiobooks.
Authentic and Engaging Conversations
“By combining SenseVoice, LLMs, and CosyVoice, we can effortlessly perform speech-to-speech translation,” said the researchers.
SenseVoice recognizes the input speech in its original language, the LLM translates the source language into the target language, and CosyVoice synthesizes the translated text into speec, producing audio that retains the user’s voice characteristics through cross-lingual voice cloning. “This allows users to speak in foreign languages using their own voice,” they noted.
In a post on X, the researchers highlighted that this method not only improves translation efficiency and fluency but also captures the emotions and tones in the original speech, reproducing these emotional nuances in the translated speech.
“This makes conversations more authentic and engaging,” they said, and “significantly reduces language barriers and communication losses” in contexts such as multilingual conference interpreting, cross-cultural communication, or providing instant voice translation services for non-native speakers.
FunAudioLLM supports a wide range of languages, enhancing its utility in global applications. Demos and the code are available on GitHub.
Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Siqi Zheng