Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Batthacharya

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya

TL;DR

The paper tackles the $O(T^2)$ complexity of self-attention in ASR encoders and introduces an extension of SummaryMixing to a streaming Conformer Transducer. By combining SummaryMixing with Dynamic Chunk Training and Dynamic Chunk Convolution, it enables a linear-time encoder that can operate in both streaming and offline modes without changing architecture. The approach yields WER that matches or surpasses MHSA across Librispeech and Voxpopuli, while delivering faster training and substantially lower peak memory; decoding latency remains efficient with an effectively infinite left context. The work provides an open-source recipe in SpeechBrain and demonstrates practical benefits for on-device streaming ASR and future linear-time encoder designs.

Abstract

Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

TL;DR

The paper tackles the

complexity of self-attention in ASR encoders and introduces an extension of SummaryMixing to a streaming Conformer Transducer. By combining SummaryMixing with Dynamic Chunk Training and Dynamic Chunk Convolution, it enables a linear-time encoder that can operate in both streaming and offline modes without changing architecture. The approach yields WER that matches or surpasses MHSA across Librispeech and Voxpopuli, while delivering faster training and substantially lower peak memory; decoding latency remains efficient with an effectively infinite left context. The work provides an open-source recipe in SpeechBrain and demonstrates practical benefits for on-device streaming ASR and future linear-time encoder designs.

Abstract

Paper Structure (9 sections, 3 equations, 2 figures, 1 table)

This paper contains 9 sections, 3 equations, 2 figures, 1 table.

Introduction
Streaming and Non-Streaming SummaryMixing
Dynamic Chunk Convolutions and Training
Streaming SummaryMixing
Experiments
Experimental protocol
Speech recognition results
Extended analysis of streaming SummaryMixing
Conclusion

Figures (2)

Figure 1: The Conformer with non-streaming (left) and streaming (right) SummaryMixing.
Figure 2: Word error rate variations of the MHSA- and SummaryMixing- equipped models trained on VoxPopuli depending on the utterance length (left curve). Sentences are obtained from the test set of VoxPopuli. The middle curve shows the inference real-time factor observed with CPU and GPU for the two models. The right-most curve gives the peak VRAM from both models when decoding. The left context is infinite for all experiments.

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

TL;DR

Abstract

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)