Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya
TL;DR
The paper tackles the $O(T^2)$ complexity of self-attention in ASR encoders and introduces an extension of SummaryMixing to a streaming Conformer Transducer. By combining SummaryMixing with Dynamic Chunk Training and Dynamic Chunk Convolution, it enables a linear-time encoder that can operate in both streaming and offline modes without changing architecture. The approach yields WER that matches or surpasses MHSA across Librispeech and Voxpopuli, while delivering faster training and substantially lower peak memory; decoding latency remains efficient with an effectively infinite left context. The work provides an open-source recipe in SpeechBrain and demonstrates practical benefits for on-device streaming ASR and future linear-time encoder designs.
Abstract
Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.
