Table of Contents
Fetching ...

Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation

Keqi Deng, Philip C. Woodland

TL;DR

The paper tackles end-to-end simultaneous speech translation by enabling true streaming and reordering with a label-synchronous neural transducer. It introduces the Auto-regressive Integrate-and-Fire mechanism and a latency-controllable variant to flexibly trade quality and latency, along with chunk-based incremental joint decoding to refine the search without adding latency. The LS-Transducer-SST leverages a target-side LM-like prediction network and a target-side CTC branch to improve alignment and allow monolingual data utilization, addressing data sparsity in SST. Experimental results on FSC Es-En and MuST-C En-De show superior quality-latency trade-offs versus strong baselines, and cross-domain tests confirm robustness and adaptability, making it a competitive, versatile SST approach for low-to-medium latency applications.

Abstract

While the neural transducer is popular for online speech recognition, simultaneous speech translation (SST) requires both streaming and re-ordering capabilities. This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for SST, which naturally possesses these two properties. The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressive Integrate-and-Fire (AIF) mechanism. A latency-controllable AIF is also proposed, which can control the quality-latency trade-off either only during decoding, or it can be used in both decoding and training. The LS-Transducer-SST can naturally utilise monolingual text-only data via its prediction network which helps alleviate the key issue of data sparsity for E2E SST. During decoding, a chunk-based incremental joint decoding technique is designed to refine and expand the search space. Experiments on the Fisher-CallHome Spanish (Es-En) and MuST-C En-De data show that the LS-Transducer-SST gives a better quality-latency trade-off than existing popular methods. For example, the LS-Transducer-SST gives a 3.1/2.9 point BLEU increase (Es-En/En-De) relative to CAAT at a similar latency and a 1.4 s reduction in average lagging latency with similar BLEU scores relative to Wait-k.

Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation

TL;DR

The paper tackles end-to-end simultaneous speech translation by enabling true streaming and reordering with a label-synchronous neural transducer. It introduces the Auto-regressive Integrate-and-Fire mechanism and a latency-controllable variant to flexibly trade quality and latency, along with chunk-based incremental joint decoding to refine the search without adding latency. The LS-Transducer-SST leverages a target-side LM-like prediction network and a target-side CTC branch to improve alignment and allow monolingual data utilization, addressing data sparsity in SST. Experimental results on FSC Es-En and MuST-C En-De show superior quality-latency trade-offs versus strong baselines, and cross-domain tests confirm robustness and adaptability, making it a competitive, versatile SST approach for low-to-medium latency applications.

Abstract

While the neural transducer is popular for online speech recognition, simultaneous speech translation (SST) requires both streaming and re-ordering capabilities. This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for SST, which naturally possesses these two properties. The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressive Integrate-and-Fire (AIF) mechanism. A latency-controllable AIF is also proposed, which can control the quality-latency trade-off either only during decoding, or it can be used in both decoding and training. The LS-Transducer-SST can naturally utilise monolingual text-only data via its prediction network which helps alleviate the key issue of data sparsity for E2E SST. During decoding, a chunk-based incremental joint decoding technique is designed to refine and expand the search space. Experiments on the Fisher-CallHome Spanish (Es-En) and MuST-C En-De data show that the LS-Transducer-SST gives a better quality-latency trade-off than existing popular methods. For example, the LS-Transducer-SST gives a 3.1/2.9 point BLEU increase (Es-En/En-De) relative to CAAT at a similar latency and a 1.4 s reduction in average lagging latency with similar BLEU scores relative to Wait-k.
Paper Structure (38 sections, 5 equations, 10 figures, 13 tables)

This paper contains 38 sections, 5 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Illustration of the proposed LS-Transducer-SST. Linear denotes a linear classifier. Target-side CTC uses translations in the training objective computation.
  • Figure 2: Illustration of latency-controllable AIF. $t$ denotes the time step. $\alpha$ is the frame-level weight. The black solid line shows when the tokens are emitted under standard AIF; the red dotted line illustrates the case when the AIF decision threshold is increased by 1.
  • Figure 3: Illustration of the proposed chunk-based incremental joint decoding. (a) an illustration of the chunk-based mask; (b) an example of the chunk-based incremental pruning according to the accumulated AIF weights $\sum \alpha$, in which the chunk size is 7, the beam size is 2 within a chunk, the decision threshold of the $i$-th output $y^{(i)}$ is $i$.
  • Figure 4: Quality-latency trade-off of LS-Transducer-SST on MuST-C En-De tst-COMMON set. The 5 dots for the latency-controllable AIF are $\epsilon \in \{0, 1, 3, 5, 7\}$.
  • Figure 5: Quality-latency trade-off curves on MuST-C En-De tst-COMMON set. Solid lines are comparable with technique results from the literature. Dotted line indicates the use of wav2vec2.0. All results use sequence-level KD in training.
  • ...and 5 more figures