Table of Contents
Fetching ...

Overcoming Non-monotonicity in Transducer-based Streaming Generation

Zhengrui Ma, Yang Feng, Min Zhang

TL;DR

This work tackles non-monotonic alignments in Transducer-based streaming generation by introducing MonoAttn-Transducer, which learns a monotonic cross-attention via posterior alignments inferred through a forward-backward process on the 2D alignment lattice. The predictor attends to the encoder history up to the currently observed input, with context vectors $c_u$ computed from posterior weights and energies in an efficient formulation. A training scheme uses either a posterior alignment (via forward-backward) or a prior alignment (diagonal/uniform) to avoid exhaustive alignment enumeration and includes a chunk-synchronization mechanism to align training with streaming inference. Empirical results on speech-to-text and speech-to-speech simultaneous translation show consistent quality gains with latency comparable to or better than baselines, especially under higher non-monotonicity, validating the approach’s practical impact for real-time, complex streaming tasks.

Abstract

Streaming generation models are utilized across fields, with the Transducer architecture being popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation. In this research, we address this issue by integrating Transducer's decoding with the history of input stream via a learnable monotonic attention. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the monotonic context representations, thereby avoiding the need to enumerate the exponentially large alignment space during training. Extensive experiments show that our MonoAttn-Transducer effectively handles non-monotonic alignments in streaming scenarios, offering a robust solution for complex generation tasks.

Overcoming Non-monotonicity in Transducer-based Streaming Generation

TL;DR

This work tackles non-monotonic alignments in Transducer-based streaming generation by introducing MonoAttn-Transducer, which learns a monotonic cross-attention via posterior alignments inferred through a forward-backward process on the 2D alignment lattice. The predictor attends to the encoder history up to the currently observed input, with context vectors computed from posterior weights and energies in an efficient formulation. A training scheme uses either a posterior alignment (via forward-backward) or a prior alignment (diagonal/uniform) to avoid exhaustive alignment enumeration and includes a chunk-synchronization mechanism to align training with streaming inference. Empirical results on speech-to-text and speech-to-speech simultaneous translation show consistent quality gains with latency comparable to or better than baselines, especially under higher non-monotonicity, validating the approach’s practical impact for real-time, complex streaming tasks.

Abstract

Streaming generation models are utilized across fields, with the Transducer architecture being popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation. In this research, we address this issue by integrating Transducer's decoding with the history of input stream via a learnable monotonic attention. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the monotonic context representations, thereby avoiding the need to enumerate the exponentially large alignment space during training. Extensive experiments show that our MonoAttn-Transducer effectively handles non-monotonic alignments in streaming scenarios, offering a robust solution for complex generation tasks.

Paper Structure

This paper contains 25 sections, 13 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example of diagonal prior and posterior alignment from MuST-C English-to-Spanish training corpus. The vertical axis represents the target subword sequence and the horizontal axis represents the speech waveform. Darker areas indicate higher alignment probabilities. Chunk size in this example is set to 640ms. More examples are provided in App. \ref{['app:visualization']}.
  • Figure 2: (a), (b): Results of translation quality (BLEU) against latency (Average Lagging, AL) on MuST-C English to German and English to Spanish datasets. (c): Performance on MuST-C English to Spanish test subsets categorized by non-monotonicity. In the figures above, MA-T denotes MonoAttn-Transducer.
  • Figure 3: Chunk size in this example is set to 320ms. (Diagonal Prior)
  • Figure 4: Chunk size in this example is set to 640ms. (Diagonal Prior)
  • Figure 5: Chunk size in this example is set to 960ms. (Diagonal Prior)
  • ...and 1 more figures