Tracing Meta’s Path to Text and Speech Translation Model SeamlessM4T – slator.com


Social network giant Meta made a splash in August 2023 with SeamlessM4T, a model that offers different combinations of text and speech translation for dozens of languages, at a minimum.

What makes the multimodal model unique is its ability to perform in both text and speech — as opposed to siloing these capabilities in separate models. 

In their 100-page paper on the model, the more than 60 authors encouraged comparisons to the science fiction touchpoint often invoked in hype for multilingual technological advancements. Observers obliged.

“How far are we from the BabelFish?” one asked on X. “Put this sucker in a phone and we are pretty much there.”

But, as any language industry veteran will attest, silver bullets seem to lose their luster upon closer inspection.

Responding to a LinkedIn post by Meta VP and Chief AI Scientist Yann LeCun, commenters asked whether SeamlessM4T offered speaker recognition, probed on the model’s ability to handle source speech containing more than one language, and pointed out specific languages currently unavailable for certain speech/text translation combinations. 

Even fans shared their reservations, or more measured takes. Translator William Dan praised SeamlessM4T for its speed and its ability to run on a user’s local GPU — naturally a better option for data protection than a model accessed online.

Still, Dan admitted, the text it produces is undeniably machine translation, with the accompanying grammatical mistakes and even issues with missing translations.

“But to be honest, if companies don’t cut translator rates with SeamlessM4T, I’ll post-edit the outputs without a single complaint,” Dan added.

Predicting Impact

Of course, with a model that provides more than “just” text-to-text translation, translators are not the only language professionals interested in how SeamlessM4T might fit into their work.

Notably, Claudio Fantinuoli, CTO of interpreting tech company Kudo, was quoted as saying Meta’s tool — and others that similarly combine functions in one system — are the future. This was in an article in the English language edition of El Pais, a leading newspaper in Spain, about AI’s influence on simultaneous interpreting.

According to the article, Kudo already has 20 clients who use the company’s tool for automated interpreting; Kudo product manager Tzachi Levy describes this solution as “effective” for smaller meetings where human interpreters are not present.

Just as Kudo continues to improve its own offering — which could be likened to a real-time dub — Meta’s SeamlessM4T seems to be another big advance in speech translation research.

2023 Language Industry Market Report (MAIN TITLE IMAGE)

Slator 2023 Language Industry Market Report

140-page flagship report on market-size, LLM and GPT impact, TMS, AI dubbing, interpreting, game loc, market outlook, and more.

Setting the Stage

Beyond bragging rights, Meta’s interest in real-time speech translation is likely tied to the company’s goal of bringing its “metaverse” to the widest audience possible, with frictionless communication across languages as a selling point.

This was the case for Meta’s No Language Left Behind (NLLB) project, touted in July 2022 as helping speakers of thousands of languages access “new, immersive experiences in virtual worlds.” While NLLB’s focus was low-resource languages, its scale — 200 languages, and 40,000 possible translation directions — eclipses SeamlessM4T, at least for now.

Meta’s Massively Multilingual Speech project, which debuted in May 2023, brought the company one step closer to SeamlessM4T by offering the usually disparate speech-to-text and text-to-speech translation in a single system. Again, Meta went big, covering 1,100 languages, albeit with mixed results.

Voicebox, introduced in June 2023, was billed as the first model to generalize to speech-generation tasks (i.e., to be able to handle speech generation without specifically being trained for that kind of task). Trained on over 50,000 hours of recorded speech and transcripts in six languages, Voicebox can use input text or text and audio in one of six languages to generate audio output in another language.

Unlike other recent advancements, Meta decided not to open-source the Voicebox model or code in order to prevent misuse. SeamlessM4T, meanwhile, and its code and metadata, are available on GitHub.

SeamlessM4T has already inspired a September 2023 hackathon — to the tune of 265 participants and eight AI applications (finalists TBD). The question now is, in which direction Meta will take this latest development, and how soon will the public find out?





Source link