Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
Niko Moritz, Ruiming Xie, Yashesh Gaur, Ke Li, Simone Merello, Zeeshan Ahmed, Frank Seide, Christian Fuegen
TL;DR
The paper introduces JSTAR, a streaming end-to-end model for joint automatic speech recognition and translation built on a fast-slow RNN-T encoder with separate ASR and ST predictors. It demonstrates that transducer-based MT can initialize JSTAR and that multi-talker training via SOT supports bilingual conversations with overlaps on a smart-glasses platform, achieving competitive BLEU scores and reduced latency relative to cascaded baselines. Extensive experiments on MC-FLEURS and RealConv show that JSTAR benefits from both supervised and unsupervised data, with notable latency reductions and improved translation quality. The work also shows that streaming MT with a transducer can approach transformer-level performance while enabling online translation, and that pretraining JSTAR components yields additional gains.
Abstract
We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.
