AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Sara Papi; Marco Turchi; Matteo Negri

AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Sara Papi, Marco Turchi, Matteo Negri

TL;DR

The paper tackles the latency-accuracy trade-off in simultaneous speech translation by proposing AlignAtt, a policy that leverages audio-translation alignments derived from cross-attention in an offline-trained ST model. AlignAtt computes per-token alignments via $Align_i = \arg\max_j A_C(\mathbf{x}, \mathbf{y}_i)$ and emits tokens only when alignment falls outside the most recent $f$ input frames, letting the model wait for more audio when needed. Evaluated on MuST-C v1.0 across 8 language directions, AlignAtt achieves about 2 BLEU-point gains with latency reductions of $0.5$–$0.8$ seconds compared with prior SimulST policies, and attains near-offline quality while keeping latency competitive. The results establish AlignAtt as a strong, training-free policy for simultaneous translation on offline-trained models, with open-source code and models to support reproducibility and practical deployment.

Abstract

Attention is the core mechanism of today's most used architectures for natural language processing and has been analyzed from many perspectives, including its effectiveness for machine translation-related tasks. Among these studies, attention resulted to be a useful source of information to get insights about word alignment also when the input text is substituted with audio segments, as in the case of the speech translation (ST) task. In this paper, we propose AlignAtt, a novel policy for simultaneous ST (SimulST) that exploits the attention information to generate source-target alignments that guide the model during inference. Through experiments on the 8 language pairs of MuST-C v1.0, we show that AlignAtt outperforms previous state-of-the-art SimulST policies applied to offline-trained models with gains in terms of BLEU of 2 points and latency reductions ranging from 0.5s to 0.8s across the 8 languages.

AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

TL;DR

and emits tokens only when alignment falls outside the most recent

input frames, letting the model wait for more audio when needed. Evaluated on MuST-C v1.0 across 8 language directions, AlignAtt achieves about 2 BLEU-point gains with latency reductions of

–

seconds compared with prior SimulST policies, and attains near-offline quality while keeping latency competitive. The results establish AlignAtt as a strong, training-free policy for simultaneous translation on offline-trained models, with open-source code and models to support reproducibility and practical deployment.

Abstract

Paper Structure (11 sections, 3 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 3 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
AlignAtt policy
Experimental Settings
Data
Architecture and Training Setup
Terms of Comparison
Inference and Evaluation
Results
Offline Results
Simultaneous Results
Conclusions

Figures (2)

Figure 1: Example of the AlignAtt policy with $f=2$ at consecutive time steps $t_1$ (a) and $t_2$ (b).
Figure 2: LAAL-BLEU curves for all the 8 language pairs of MuST-C tst-COMMON.AlignAtt is compared to the SimulST policy presented in Section \ref{['subsec:comparison']}. Latency (LAAL) is computationally aware and expressed in seconds ($s$).

AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

TL;DR

Abstract

AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)