SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya
TL;DR
ASR models rely on self-attention, which incurs quadratic time and memory costs in sequence length. This paper introduces SummaryMixing, a linear-time block that computes a global summary vector $ar{oldsymbol{s}}= rac{1}{T} ext{sum}_{t=1}^{T} s(oldsymbol{x}_{t})$ and fuses it with local features via $oldsymbol{h}_{t}= c(f(oldsymbol{x}_{t}), ar{oldsymbol{s}})$, enabling $O(T)$ complexity. By replacing the MHSA component in Branchformer and Conformer with SummaryMixing, the authors demonstrate comparable or superior WER across five datasets and SLU/KWS tasks, while achieving up to 28% faster training and halving peak VRAM usage. The approach extends to diverse languages and acoustic conditions, indicating that a global utterance summary can suffice for effective speech encoding and opens the door to scalable, low-resource ASR/SLU systems. The results suggest SummaryMixing as a practical, generalizable alternative to self-attention in speech processing backbones, with potential for broader deployment.
Abstract
Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.
