Streaming Sequence Transduction through Dynamic Compression
Weiting Tan, Yunmo Chen, Tongfei Chen, Guanghui Qin, Haoran Xu, Heidi C. Zhang, Benjamin Van Durme, Philipp Koehn
TL;DR
STAR introduces Stream Transduction with Anchor Representations, a Transformer-based framework that dynamically segments input streams and compresses segments into anchor representations to enable efficient streaming sequence-to-sequence transduction. By learning a cross-attention–informed segmenter and performing selection-based compression, STAR achieves substantial memory savings and superior latency-quality trade-offs in simultaneous ASR and ST tasks, outperforming CNN and CIF baselines across compression rates and datasets. The approach balances latency, memory footprint, and output quality, demonstrating robustness to inference-time compression variations and segmentation policies, with practical gains in both non-streaming and streaming contexts. Overall, STAR advances streaming sequence transduction by integrating dynamic segmentation, anchor-based compression, and end-to-end training to deliver strong performance with reduced resource usage in real-time applications.
Abstract
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
