Table of Contents
Fetching ...

An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

Ryan Whetten, Titouan Parcollet, Adel Moumen, Marco Dinarelli, Yannick Estève

TL;DR

This work studies the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba, and shows that these linear alternatives maintain competitive performance compared to MHSA.

Abstract

Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.

An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

TL;DR

This work studies the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba, and shows that these linear alternatives maintain competitive performance compared to MHSA.

Abstract

Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.
Paper Structure (8 sections, 5 equations, 1 figure, 2 tables)

This paper contains 8 sections, 5 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Inference speed and peak memory of BEST-RQ base and large models with various types of attention. Input length is increased from 10 to 80 seconds. Multi-head self-attention (MHSA) requires significantly more time and VRAM as input size increases where as the alternatives do not.