Table of Contents
Fetching ...

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya

TL;DR

The paper tackles the $O(T^2)$ complexity of self-attention in ASR encoders and introduces an extension of SummaryMixing to a streaming Conformer Transducer. By combining SummaryMixing with Dynamic Chunk Training and Dynamic Chunk Convolution, it enables a linear-time encoder that can operate in both streaming and offline modes without changing architecture. The approach yields WER that matches or surpasses MHSA across Librispeech and Voxpopuli, while delivering faster training and substantially lower peak memory; decoding latency remains efficient with an effectively infinite left context. The work provides an open-source recipe in SpeechBrain and demonstrates practical benefits for on-device streaming ASR and future linear-time encoder designs.

Abstract

Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

TL;DR

The paper tackles the complexity of self-attention in ASR encoders and introduces an extension of SummaryMixing to a streaming Conformer Transducer. By combining SummaryMixing with Dynamic Chunk Training and Dynamic Chunk Convolution, it enables a linear-time encoder that can operate in both streaming and offline modes without changing architecture. The approach yields WER that matches or surpasses MHSA across Librispeech and Voxpopuli, while delivering faster training and substantially lower peak memory; decoding latency remains efficient with an effectively infinite left context. The work provides an open-source recipe in SpeechBrain and demonstrates practical benefits for on-device streaming ASR and future linear-time encoder designs.

Abstract

Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.
Paper Structure (9 sections, 3 equations, 2 figures, 1 table)

This paper contains 9 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The Conformer with non-streaming (left) and streaming (right) SummaryMixing.
  • Figure 2: Word error rate variations of the MHSA- and SummaryMixing- equipped models trained on VoxPopuli depending on the utterance length (left curve). Sentences are obtained from the test set of VoxPopuli. The middle curve shows the inference real-time factor observed with CPU and GPU for the two models. The right-most curve gives the peak VRAM from both models when decoding. The left context is infinite for all experiments.