Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

Niko Moritz; Ruiming Xie; Yashesh Gaur; Ke Li; Simone Merello; Zeeshan Ahmed; Frank Seide; Christian Fuegen

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

Niko Moritz, Ruiming Xie, Yashesh Gaur, Ke Li, Simone Merello, Zeeshan Ahmed, Frank Seide, Christian Fuegen

TL;DR

The paper introduces JSTAR, a streaming end-to-end model for joint automatic speech recognition and translation built on a fast-slow RNN-T encoder with separate ASR and ST predictors. It demonstrates that transducer-based MT can initialize JSTAR and that multi-talker training via SOT supports bilingual conversations with overlaps on a smart-glasses platform, achieving competitive BLEU scores and reduced latency relative to cascaded baselines. Extensive experiments on MC-FLEURS and RealConv show that JSTAR benefits from both supervised and unsupervised data, with notable latency reductions and improved translation quality. The work also shows that streaming MT with a transducer can approach transformer-level performance while enabling online translation, and that pretraining JSTAR components yields additional gains.

Abstract

We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

TL;DR

Abstract

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)