Table of Contents
Fetching ...

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

TL;DR

ASR models rely on self-attention, which incurs quadratic time and memory costs in sequence length. This paper introduces SummaryMixing, a linear-time block that computes a global summary vector $ar{oldsymbol{s}}= rac{1}{T} ext{sum}_{t=1}^{T} s(oldsymbol{x}_{t})$ and fuses it with local features via $oldsymbol{h}_{t}= c(f(oldsymbol{x}_{t}), ar{oldsymbol{s}})$, enabling $O(T)$ complexity. By replacing the MHSA component in Branchformer and Conformer with SummaryMixing, the authors demonstrate comparable or superior WER across five datasets and SLU/KWS tasks, while achieving up to 28% faster training and halving peak VRAM usage. The approach extends to diverse languages and acoustic conditions, indicating that a global utterance summary can suffice for effective speech encoding and opens the door to scalable, low-resource ASR/SLU systems. The results suggest SummaryMixing as a practical, generalizable alternative to self-attention in speech processing backbones, with potential for broader deployment.

Abstract

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

TL;DR

ASR models rely on self-attention, which incurs quadratic time and memory costs in sequence length. This paper introduces SummaryMixing, a linear-time block that computes a global summary vector and fuses it with local features via , enabling complexity. By replacing the MHSA component in Branchformer and Conformer with SummaryMixing, the authors demonstrate comparable or superior WER across five datasets and SLU/KWS tasks, while achieving up to 28% faster training and halving peak VRAM usage. The approach extends to diverse languages and acoustic conditions, indicating that a global utterance summary can suffice for effective speech encoding and opens the door to scalable, low-resource ASR/SLU systems. The results suggest SummaryMixing as a practical, generalizable alternative to self-attention in speech processing backbones, with potential for broader deployment.

Abstract

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.
Paper Structure (13 sections, 2 equations, 2 figures, 2 tables)

This paper contains 13 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison of the self-attention cell (left) and the newly proposed SummaryMixing cell (right). In SummaryMixing, the information from all time steps is averaged, and this average is fed back to each time step $T$.
  • Figure 2: Efficiency measurements and real-time factor analysis. The left- and right-most curves represent the average time as well as the peak VRAM consumption to process a sequence of various lengths. The curve in the middle shows the RTF for trained ASR systems.