AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation
Sara Papi, Marco Turchi, Matteo Negri
TL;DR
The paper tackles the latency-accuracy trade-off in simultaneous speech translation by proposing AlignAtt, a policy that leverages audio-translation alignments derived from cross-attention in an offline-trained ST model. AlignAtt computes per-token alignments via $Align_i = \arg\max_j A_C(\mathbf{x}, \mathbf{y}_i)$ and emits tokens only when alignment falls outside the most recent $f$ input frames, letting the model wait for more audio when needed. Evaluated on MuST-C v1.0 across 8 language directions, AlignAtt achieves about 2 BLEU-point gains with latency reductions of $0.5$–$0.8$ seconds compared with prior SimulST policies, and attains near-offline quality while keeping latency competitive. The results establish AlignAtt as a strong, training-free policy for simultaneous translation on offline-trained models, with open-source code and models to support reproducibility and practical deployment.
Abstract
Attention is the core mechanism of today's most used architectures for natural language processing and has been analyzed from many perspectives, including its effectiveness for machine translation-related tasks. Among these studies, attention resulted to be a useful source of information to get insights about word alignment also when the input text is substituted with audio segments, as in the case of the speech translation (ST) task. In this paper, we propose AlignAtt, a novel policy for simultaneous ST (SimulST) that exploits the attention information to generate source-target alignments that guide the model during inference. Through experiments on the 8 language pairs of MuST-C v1.0, we show that AlignAtt outperforms previous state-of-the-art SimulST policies applied to offline-trained models with gains in terms of BLEU of 2 points and latency reductions ranging from 0.5s to 0.8s across the 8 languages.
