Table of Contents
Fetching ...

Advancing Speech Translation: A Corpus of Mandarin-English Conversational Telephone Speech

Shannon Wotherspoon, William Hartmann, Matthew Snover

TL;DR

The paper addresses the lack of matched Mandarin-English conversational speech translation data by introducing a 123.5-hour CTS corpus derived from CallHome and HKUST sources, with English translations produced by Appen. It evaluates cascade speech translation using a TDNN-LSTM ASR and the NLLB MT model, showing that fine-tuning a general-purpose MT model to the target CTS data yields substantial gains (from $BLEU$ $5.98$ to $BLEU$ $14.16$, a $137 ext{ extpercent}$ relative improvement). The study also reports an ASR WER of $26.7$ on the CTS test and discusses the importance of domain-aligned data for translation quality. Overall, the work demonstrates that matched training data is essential for high-performance Mandarin-English conversational speech translation and provides a valuable resource for further research and development in this domain.

Abstract

This paper introduces a set of English translations for a 123-hour subset of the CallHome Mandarin Chinese data and the HKUST Mandarin Telephone Speech data for the task of speech translation. Paired source-language speech and target-language text is essential for training end-to-end speech translation systems and can provide substantial performance improvements for cascaded systems as well, relative to training on more widely available text data sets. We demonstrate that fine-tuning a general-purpose translation model to our Mandarin-English conversational telephone speech training set improves target-domain BLEU by more than 8 points, highlighting the importance of matched training data.

Advancing Speech Translation: A Corpus of Mandarin-English Conversational Telephone Speech

TL;DR

The paper addresses the lack of matched Mandarin-English conversational speech translation data by introducing a 123.5-hour CTS corpus derived from CallHome and HKUST sources, with English translations produced by Appen. It evaluates cascade speech translation using a TDNN-LSTM ASR and the NLLB MT model, showing that fine-tuning a general-purpose MT model to the target CTS data yields substantial gains (from to , a relative improvement). The study also reports an ASR WER of on the CTS test and discusses the importance of domain-aligned data for translation quality. Overall, the work demonstrates that matched training data is essential for high-performance Mandarin-English conversational speech translation and provides a valuable resource for further research and development in this domain.

Abstract

This paper introduces a set of English translations for a 123-hour subset of the CallHome Mandarin Chinese data and the HKUST Mandarin Telephone Speech data for the task of speech translation. Paired source-language speech and target-language text is essential for training end-to-end speech translation systems and can provide substantial performance improvements for cascaded systems as well, relative to training on more widely available text data sets. We demonstrate that fine-tuning a general-purpose translation model to our Mandarin-English conversational telephone speech training set improves target-domain BLEU by more than 8 points, highlighting the importance of matched training data.
Paper Structure (5 sections, 2 tables)

This paper contains 5 sections, 2 tables.