In a July 18, 2024 paper, Matthieu Futeral, Cordelia Schmid, Benoît Sagot, and Rachel Bawden from Inria, the French National Institute for Research in Digital Science and Technology, introduced ZeroMMT, a new method for multimodal machine translation (MMT) that eliminates the reliance on fully supervised data.

The researchers explained that MMT integrates additional modalities like images or videos to enhance text-based translations, especially when dealing with ambiguous text. In MMT “the main purpose is to provide an additional signal in the case of ambiguity in the text to be translated,” they noted.

Current MMT systems depend heavily on datasets that include images with multilingual captions, like Multi30K. However, creating such datasets is expensive and limits the expansion of MMT to new languages, which the researchers characterized as a “fundamental limitation.”

ZeroMMT addresses this limitation by leveraging only multimodal English data, thereby bypassing the need for supervised data. The proposed method adapts an existing text-only translation model to use visual information from images to improve translation and is “the first zero-shot method to tackle this problem,” Matthieu Futeral told Slator.

“Our goal is to train an MMT model capable of effectively using images to disambiguate contrasting translations while maintaining its translation capabilities, without using any fully supervised data,” said the researchers.

Specifically, ZeroMMT forces an MT system to use visual information from images to understand and translate sentences better, particularly in cases where the text alone may be ambiguous. To achieve this, a method called SigLIP is employed to convert images into a format that the translation system can process, and these image representations are then integrated with the text. Furthermore, to ensure that the translations remain accurate and of high quality, the outputs of ZeroMMT are compared with those of the original MT model, allowing for necessary adjustments to be made.

MAIN IMAGE - 2024 Market Report

Slator 2024 Language Industry Market Report — Language AI Edition

The 140-page flagship report features in-depth market analysis, language AI opportunities, survey results, and much more.

A Step Towards MMT

The researchers tested ZeroMMT on standard benchmarks and CoMMuTE, a contrastive benchmark for image-based disambiguation in English sentences, across six language directions (English to French, Czech, German, Arabic, Russian, and Chinese). They compared ZeroMMT against the text-only MT system NLLB and well-known fully supervised MMT systems. 

They found that ZeroMMT can leverage images to adjust translations towards the correct meaning, achieving disambiguation performance close to that of state-of-the-art MMT models trained on fully supervised data, with only a very small drop in performance where images are unnecessary for accurate translation.

The researchers emphasized that “these results show that our approach is able to maintain good translation performance whilst still being able to exploit visual information for disambiguation.”

They concluded that ZeroMMT is “a step towards having MMT systems that cover a broader set of languages without having to rely on acquiring costly training data.”

The code, data, and trained models are publicly accessible on GitHub.



Source link