Streaming Sequence Transduction through Dynamic Compression

Weiting Tan; Yunmo Chen; Tongfei Chen; Guanghui Qin; Haoran Xu; Heidi C. Zhang; Benjamin Van Durme; Philipp Koehn

Streaming Sequence Transduction through Dynamic Compression

Weiting Tan, Yunmo Chen, Tongfei Chen, Guanghui Qin, Haoran Xu, Heidi C. Zhang, Benjamin Van Durme, Philipp Koehn

TL;DR

STAR introduces Stream Transduction with Anchor Representations, a Transformer-based framework that dynamically segments input streams and compresses segments into anchor representations to enable efficient streaming sequence-to-sequence transduction. By learning a cross-attention–informed segmenter and performing selection-based compression, STAR achieves substantial memory savings and superior latency-quality trade-offs in simultaneous ASR and ST tasks, outperforming CNN and CIF baselines across compression rates and datasets. The approach balances latency, memory footprint, and output quality, demonstrating robustness to inference-time compression variations and segmentation policies, with practical gains in both non-streaming and streaming contexts. Overall, STAR advances streaming sequence transduction by integrating dynamic segmentation, anchor-based compression, and end-to-end training to deliver strong performance with reduced resource usage in real-time applications.

Abstract

We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.

Streaming Sequence Transduction through Dynamic Compression

TL;DR

Abstract

Paper Structure (36 sections, 11 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 36 sections, 11 equations, 13 figures, 3 tables, 1 algorithm.

Introduction
Methodology
Problem Formulation
Segmentation with Dynamic Compression
Learning Segmenter with Cross-attention
Compression with Anchor Representation
Model Training
Experiments: Non-Streaming Compression
Datasets and Evaluation Metrics
Training Setup
Compression with Anchor Representations
Baseline: CNN
Baseline: CIF
Results of Different Compression Methods
Streaming Experiments: Simultaneous Speech Recognition and Translation
...and 21 more sections

Figures (13)

Figure 1: When yield is triggered, the current segment's information is compressed into an anchor representation to generate the next output.
Figure 2: Visualization for the training of the segmenter through feedback from the encoder-decoder's cross-attention.
Figure 3: Visualization for the proposed "selection as compression" method. Input features are transformed by the encoder and we only select the encoding at the anchor position (where yield is triggered) as the compressed representation.
Figure 4: ASR performance (evaluated by WER) by different compression methods. From the figure, STAR outperforms other compressors and the gap enlarges as the compression rate increases.
Figure 5: Lateny-Quality trade-off for CIF and STAR. The five markers on the line correspond to different wait-$k$ strategies (from left to right, wait-$k$$\in \{1,2,3,4,5\}$).
...and 8 more figures

Streaming Sequence Transduction through Dynamic Compression

TL;DR

Abstract

Streaming Sequence Transduction through Dynamic Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (13)