Polynomial Mixing for Efficient Self-supervised Speech Encoders

Eva Feillet; Ryan Whetten; David Picard; Alexandre Allauzen

Polynomial Mixing for Efficient Self-supervised Speech Encoders

Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen

TL;DR

A novel token-mixing mechanism, the Polynomial Mixer (PoM), is proposed as a drop-in replacement for multi-head self-attention, offering an improved trade-off between performance and efficiency in time and memory.

Abstract

State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.

Polynomial Mixing for Efficient Self-supervised Speech Encoders

TL;DR

Abstract

Paper Structure (13 sections, 6 equations, 2 figures, 3 tables)

This paper contains 13 sections, 6 equations, 2 figures, 3 tables.

Introduction
Related work
Self-supervised speech recognition models
Token mixing methods
Method
Polynomial Mixer for speech
Variants of PoM
Results
Experimental setting
Main results
Ablation study
Perspectives
Conclusion

Figures (2)

Figure 1: Principle of the Polynomial Mixer. The input sequence is projected through $k$ polynomial branches, aggregated into a global representation $H(X)$, and combined with a token-wise selector $S$. The output is obtained by projecting the selected state back to the input space.
Figure 2: Inference time and peak memory usage of BEST-RQ models ($\sim95$M params) with various token mixers. Input length is increased from 10 to 80 seconds. MHA requires significantly more time and VRAM as the input size increases in comparison with linear alternatives, including PoM.

Polynomial Mixing for Efficient Self-supervised Speech Encoders

TL;DR

Abstract

Polynomial Mixing for Efficient Self-supervised Speech Encoders

Authors

TL;DR

Abstract

Table of Contents

Figures (2)