StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection
Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli
TL;DR
This work pioneers streaming speech-to-text translation (StreamST) by introducing StreamAtt, the first policy for direct streaming ST, and StreamLAAL, a latency metric enabling fair comparison with SimulST. StreamAtt employs cross-attention-guided Hypothesis and Audio History Selection, complemented by textual history heuristics (Fixed Words or Punctuation) to balance translation quality and latency in unsegmented audio streams. Across MuST-C v1.0’s eight language pairs, StreamAtt substantially outperforms a naive streaming baseline and achieves competitive low-latency results with the SimulST upper bound AlignAtt, marking a significant step toward practical StreamST. The work also analyzes punctuation-based history, reveals its drawbacks, and outlines principled directions for future improvements in data augmentation and history management to further close the gap to SimulST.
Abstract
Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.
