On February 5, 2025, French AI research lab Kyutai introduced Hibiki, “a model for simultaneous, on-device, high fidelity speech-to-speech translation.” Hibiki is Japanese for “resonance” or “echo” but is also a famous Whisky brand.

Kyutai researchers explained that, unlike other offline speech translation systems, which require waiting for the full source utterance before starting the translation, Hibiki processes source and target speech simultaneously. This allows the model to adapt dynamically, gathering just the right amount of context to produce accurate translations in real time.

A standout feature of Hibiki is “contextual alignment,” a method that identifies the optimal delay for translation at the word level by leveraging an external machine translation model. This approach allows Hibiki to insert natural pauses in speech, ensuring smoother and more natural translations.

“By […] introducing proper silences into target speech, we can train Hibiki to adapt its flow in real-time, without the need for complex inference policies,” the researchers noted.

Additionally, Hibiki uses a multistream architecture to generate both spoken audio and written text at the same time. Operating at a fixed rate of 12.5Hz — approximately every 80 milliseconds — it produces smooth, continuous speech that stays in sync with timestamped text.

2024 Cover Slator Pro Guide Translation AI

2024 Slator Pro Guide: Translation AI

The 2024 Slator Pro Guide presents 20 new and impactful ways that LLMs can be used to enhance translation workflows.

“As the user speaks, Hibiki generates natural speech in the target language, with voice transfer, along with a text translation,” they said.

Kyutai claims that “Hibiki is the first model to provide an experience of interpretation close to human professionals,” while outperforming existing models in translation quality, speaker fidelity, and naturalness.

According to Kyutai, human evaluations confirmed Hibiki’s superior performance, with the company stating on X: “Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters.”

Large-Scale Deployment

Hibiki’s main backbone consists of 2 billion parameters and can process multiple translation tasks at once, making it highly efficient for “large-scale deployment.”

For on-device applications, Kyutai has also introduced Hibiki-M, a lighter 1-billion-parameter version capable of running real-time translations on smartphones.

Kyutai’s co-founder and CTO, Laurent Mazaré, noted in a post on X that Hibiki is “robust to extreme background conditions” and can even function without full network access.

Currently, Hibiki only supports French-to-English translation, but Kyutai wants to extend Hibiki to support many more languages, with the aim “to deliver a definitive solution for live speech translation.”

As part of its open-science initiative, Kyutai has released the Hibiki models, inference code and weights, and a 900-hour synthetic dataset. The company also invites users to explore sample outputs showcasing Hibiki’s potential applications.

Authors: Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, and Neil Zeghidour



Source link