Table of Contents
Fetching ...

Joint Training And Decoding for Multilingual End-to-End Simultaneous Speech Translation

Wuwei Huang, Renren Jin, Wen Zhang, Jian Luan, Bin Wang, Deyi Xiong

TL;DR

This work tackles the challenge of multilingual end-to-end simultaneous speech translation by proposing two joint-training architectures—separate decoders and a unified shared decoder—and introducing a joint asynchronous training strategy to promote cross-language transfer. It constructs a multilingual TED-based dataset (TED-MMST) for benchmarking and demonstrates that both architectures improve BLEU scores over bilingual baselines, with the unified model offering stronger gains and fewer parameters. Moreover, asynchronous training further boosts translation quality, likely by enhancing cross-language anticipation. The methods and dataset provide a practical path toward real-time, multilingual speech translation in settings like international conferences and online multilingual content.

Abstract

Recent studies on end-to-end speech translation(ST) have facilitated the exploration of multilingual end-to-end ST and end-to-end simultaneous ST. In this paper, we investigate end-to-end simultaneous speech translation in a one-to-many multilingual setting which is closer to applications in real scenarios. We explore a separate decoder architecture and a unified architecture for joint synchronous training in this scenario. To further explore knowledge transfer across languages, we propose an asynchronous training strategy on the proposed unified decoder architecture. A multi-way aligned multilingual end-to-end ST dataset was curated as a benchmark testbed to evaluate our methods. Experimental results demonstrate the effectiveness of our models on the collected dataset. Our codes and data are available at: https://github.com/XiaoMi/TED-MMST.

Joint Training And Decoding for Multilingual End-to-End Simultaneous Speech Translation

TL;DR

This work tackles the challenge of multilingual end-to-end simultaneous speech translation by proposing two joint-training architectures—separate decoders and a unified shared decoder—and introducing a joint asynchronous training strategy to promote cross-language transfer. It constructs a multilingual TED-based dataset (TED-MMST) for benchmarking and demonstrates that both architectures improve BLEU scores over bilingual baselines, with the unified model offering stronger gains and fewer parameters. Moreover, asynchronous training further boosts translation quality, likely by enhancing cross-language anticipation. The methods and dataset provide a practical path toward real-time, multilingual speech translation in settings like international conferences and online multilingual content.

Abstract

Recent studies on end-to-end speech translation(ST) have facilitated the exploration of multilingual end-to-end ST and end-to-end simultaneous ST. In this paper, we investigate end-to-end simultaneous speech translation in a one-to-many multilingual setting which is closer to applications in real scenarios. We explore a separate decoder architecture and a unified architecture for joint synchronous training in this scenario. To further explore knowledge transfer across languages, we propose an asynchronous training strategy on the proposed unified decoder architecture. A multi-way aligned multilingual end-to-end ST dataset was curated as a benchmark testbed to evaluate our methods. Experimental results demonstrate the effectiveness of our models on the collected dataset. Our codes and data are available at: https://github.com/XiaoMi/TED-MMST.

Paper Structure

This paper contains 17 sections, 5 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Diagram of joint multilingual end-to-end simultaneous speech translation.