Researchers Shannon Wotherspoon, William Hartmann, and Matthew Snover from Raytheon BBN published a paper in March 2024 introducing a corpus based on a set of Mandarin Chinese audio training data for speech machine translation (MT) with matched text translations into English. Raytheon BBN is a research and technology company within the Raytheon (RTX) group, a major defense contractor based in Cambridge, USA.

The purpose of the type of paired source-language speech and target-language text dataset, explained the researchers, was to create a Mandarin-English corpus to train end-to-end speech translation systems and improve cascaded systems as well.

The researchers argue that the resulting corpus is “addressing a critical gap in resources and underscoring the importance of domain-specific data in advancing the state-of-the-art in speech translation.”

The process used by the Raytheon BBN researchers involved sourcing data from 123.5 hours of Mandarin telephone conversations. The speech data was obtained from two public datasets: the CallHome Mandarin Chinese Speech and the HKUST Mandarin Telephone Speech datasets.

The CallHome dataset contained 242 unscripted telephone conversations between native Mandarin speakers, whereas the HKUST dataset contained 90 hours of speech from 1,124 conversations between Mandarin speakers (not necessarily all native speakers) in Mainland China.

The data were split into train, development, and test sets, with the train set being a mix of both Mandarin datasets. For the two development sets and the test set the researchers used only CallHome dataset conversations.

The text translations into English were done by Mandarin-English bilingual annotators at Appen, using transcripts. The annotators did not have access to the audio for the conversations and used the surrounding transcript text as context. For the final resulting text corpus, identical speech utterances were translated only once, regardless of frequency, and the annotators were instructed “to preserve any disfluencies, hesitations, or code-switching present in the data.​”

For their experiments, the researchers used output from an automatic speech recognition (ASR) model using Raytheon BBN’s own speech processing platform, called “Sage,” which the company introduced in 2016. The model was trained on the Mandarin conversational telephone speech train dataset, along with an additional 137 hours of Mandarin ASR-only data from the HKUST set.

The researchers reported evidence in their results that general-purpose models may suffice for some domain MT but fall short for others, and they performed poorly for the Mandarin conversational speech domain.

After fine-tuning the model to the conversational, domain-specific speech train set the researchers created, MT scores obtained using the BLEU metric improved by 137% compared to results from training on just general-purpose models.

Source link