Companies have been crowing about machine translation (MT) achieving human parity practically since neural MT burst onto the scene in 2016. True, MT quality continues to improve — albeit unevenly across languages and domains — but there are still many instances in which only a human translation (HT), or an expert-in-the-loop translation, will do.
Filtering HT from MT is only becoming more difficult, especially as MT is now used to create some source texts. That is where researchers at the University of Groningen saw an opportunity to develop a so-called “classifier” — a monolingual or multilingual language model fine-tuned with small amounts of task-specific labeled data.
In a May 2023 paper, Automatic Discrimination of Human and Neural Machine Translation in Multilingual Settings, authors Malina Chichirau, Rik van Noord, and Antonio Toral considered classifiers in multilingual settings.
They found that using training data from multiple source languages improved the accuracy of both monolingual and multilingual classifiers.
The researchers culled non-English source texts and their corresponding English HT and MT from WMT news shared tasks to create a data set. Monolingual classifiers were trained on English-only data, while multilingual classifiers were trained on both source texts and their English translations.
Compared to monolingual classifiers, multilingual classifiers had a higher rate of accuracy in identifying translations as human or machine, indicating that classifiers clearly benefited from access to source sentences.
And experiments with German, Russian, and Chinese showed that training on multiple source languages improved classifier performance in other languages.
A Promising Direction
“There does seem to be a diminishing effect of incorporating training data from different source languages, though, as the best score is only once obtained by combining all three languages as training data,” the authors wrote. “Nevertheless, given the improved performance for even only small amounts of additional training data (Chinese has only 1,756 training instances), we see this as a promising direction for future work.”
The group also found that fine-tuning a sentence-level model on document-length text was impactful, and preferable to simply training models on documents rather than on sentences. Fine-tuning in this way led to the highest levels of accuracy and the lowest standard deviations, indicating more stable classifiers.
Looking ahead, as text generation continues to incorporate MT, the researchers wrote, it will likely become more difficult to distinguish original texts from translations. The next logical step in this line of research, then, will address classifiers that can identify text as original, HT, or MT.