A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

Zhengrui Ma; Qingkai Fang; Shaolei Zhang; Shoutao Guo; Yang Feng; Min Zhang

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

TL;DR

A novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework and achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.

Abstract

Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

TL;DR

Abstract

Paper Structure (34 sections, 9 equations, 4 figures, 16 tables)

This paper contains 34 sections, 9 equations, 4 figures, 16 tables.

Introduction
Preliminaries
Simultaneous Speech Translation
Simul-S2T
Simul-S2S
Speech-to-Unit Translation
Approach
Architecture
Streaming Acoustic Encoder
Streaming Non-autoregressive Decoder
Latency Control
Training
Multi-task Non-monotonic Training
Two-Step Glancing
Experiments
...and 19 more sections

Figures (4)

Figure 1: NAST-S2$x$ can perform both Simul-S2T and Simul-S2S tasks within a unified end-to-end framework. The model generates speech output directly without the need to produce intermediate target text tokens
Figure 2: Overview of the proposed non-autoregressive generation framework for end-to-end simultaneous speech-to-any translation (NAST-S2$x$, $x \in \{\mathrm{text}, \mathrm{speech} \}$). Different colors indicate different chunks.
Figure 3: Results of translation quality (BLEU) against latency (Average Lagging, AL) on MuST-C En$\rightarrow$De and En$\rightarrow$Es datasets. The red solid line and dashed line illustrate the performance of NAST-S2$T$ under different chunk sizes $T_s$ or in an offline condition. The numerical results are presented in Table \ref{['table:ende']} and Table \ref{['table:enes']}.
Figure 4: Results of translation quality in offline conditions and simultaneous scenarios (ASR-BLEU or ASR-BLEU (Silence Removed) against AL or AL_EOW). The numerical results of NAST-S2$S$ are presented in Table \ref{['table:fren']} and Table \ref{['table:fren_asr_bleu_rm']}.

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

TL;DR

Abstract

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)