On July 31, 2024 ByteDance’s Cross Language Agent Team presented a system designed to deliver “high-quality” and “human-like” simultaneous speech translation (SiST).
The researchers underscored the complexity of SiST, describing it as “one of the most challenging tasks in the translation domain.” Despite notable advancements in academic and commercial SiST models, they acknowledged that “the translation quality is still far from satisfactory,” highlighting the need for a more effective solution.
Inspired by the success of large language models (LLMs) in machine translation (MT) and speech translation, the ByteDance team leveraged LLMs to tackle the SiST challenges. Their solution is a Cross-Lingual Agent that performs Simultaneous Interpretation (“CLASI”) through a systematic execution of various operations.
CLASI operates through a structured five-step process, starting with the processing of incoming audio data. To mimic professional human interpreters, who often break down sentences into smaller “semantic chunks” based on natural pauses, punctuation marks, and meaning, CLASI employs a “data-driven policy learning” method.
By training on human-annotated speech data, CLASI learns how to recognize natural breaks in speech, developing a robust “read-write policy” that guides it on when to listen (read) and when to translate (write) during the speech.
In the second step, CLASI employs a multi-modal retriever to access relevant information from an external knowledge base.
The third step involves retrieving context from the last round memory, which stores data from previous translations. By appending this retrieved information from the external knowledge base and the context from the translation memory into the LLM agent’s prompt, CLASI dynamically integrates relevant knowledge, significantly improving the accuracy and coherence of its translations, according to the researchers.
After processing the input and retrieving relevant information, CLASI generates the transcription (if needed), the translation output, and a timestamp that indicates when the current translation round ends. This timestamp allows the system to determine where to begin for the next round of audio input. It then updates its memory with the new translations, ensuring the retention of context for future processing. This cycle then restarts from step one for the next speech segment.
“Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information,” the researchers said.
Performance “Close to Human Interpreters”
To assess CLASI’s performance, the team developed a new evaluation metric called “VIP” (Versatile Interpretation Performance), which measures the amount of information that can be successfully conveyed to listeners during simultaneous speech translation/interpretation.
According to the researchers, VIP better reflects the performance of SiST systems in real-world scenarios. They tested CLASI against other top simultaneous interpretation systems, both commercial and open-source, and found that CLASI outperformed them “by significant margins.”
CLASI achieved a VIP score of 81.3% for Chinese-to-English and 78.0% for English-to-Chinese translations. In contrast, state-of-the-art commercial or open-source systems only achieved VIP scores of 35.4% and 41.6%, respectively. Even on extremely challenging datasets, where other systems scored under 13% VIP, CLASI maintained a VIP of 70%, said the researchers.
The researchers ventured as far as stating that “these results are close to the performance of human interpreters, who typically achieve around 80% VIP.”
Improving Interpreters’ Efficiency
The researchers believe the system can be applied in various scenarios to facilitate cross-lingual communication, such as international conferences and daily meetings, enabling attendees to understand speeches in different languages.
CLASI can also function as a system-level translation module, enhancing the viewing experience for users watching videos in foreign languages by providing real-time translations, added the researchers.
In the online gaming sector, CLASI could aid communications among players speaking different languages, fostering a more inclusive gaming environment. Additionally, with its “human parity performance,” it could improve the efficiency of professional human interpreters, claim the researchers.
“With the powerful translation ability of CLASI, we believe it can further make cross-lingual communication seamless across different places all over the world,” the researchers concluded.
Looking ahead, the ByteDance team plans to expand CLASI to support additional languages, including low-resource ones.
Demonstrations and human-annotated test sets are available on GitHub.
Authors: Shanbo Cheng, Zhichao Huang, Tom Ko, Hang Li, Ningxin Peng, Lu Xu, Qini Zhang