StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

Sara Papi; Marco Gaido; Matteo Negri; Luisa Bentivogli

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli

TL;DR

This work pioneers streaming speech-to-text translation (StreamST) by introducing StreamAtt, the first policy for direct streaming ST, and StreamLAAL, a latency metric enabling fair comparison with SimulST. StreamAtt employs cross-attention-guided Hypothesis and Audio History Selection, complemented by textual history heuristics (Fixed Words or Punctuation) to balance translation quality and latency in unsegmented audio streams. Across MuST-C v1.0’s eight language pairs, StreamAtt substantially outperforms a naive streaming baseline and achieves competitive low-latency results with the SimulST upper bound AlignAtt, marking a significant step toward practical StreamST. The work also analyzes punctuation-based history, reveals its drawbacks, and outlines principled directions for future improvements in data augmentation and history management to further close the gap to SimulST.

Abstract

Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 2 figures, 6 tables)

This paper contains 23 sections, 8 equations, 2 figures, 6 tables.

Introduction
Related Works
Streaming Policy
Hypothesis Selection
History Selection
Textual History Selection
Fixed Number of Words (FW).
Punctuation (P).
Audio History Selection
Streaming Latency Metric
Experimental Settings
Data
Architecture and Training Setup
Inference, Evaluation, and Comparisons
Results
...and 8 more sections

Figures (2)

Figure 1: Decision steps of the StreamST policy. The order followed by our StreamAtt policy (step keycap: 1, step keycap: 2scroll, and step keycap: 2speaker high volume) is indicated from 1 (first) to 3 (last).
Figure 2: Latency(LAAL/StreamLAAL$\downarrow$)-Quality(BLEU$\uparrow$) curves of AlignAtt and StreamAtt with Fixed Words (FW) and Punctuation (P) Textual History Selection for all the 8 language pairs of MuST-C v1.0 tst-COMMON.

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

TL;DR

Abstract

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)