Table of Contents
Fetching ...

Linear-Complexity Self-Supervised Learning for Speech Processing

Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

TL;DR

Self-supervised speech models incur high pre-training costs due to the quadratic complexity of MHSA. This work introduces SummaryMixing, a linear-complexity context encoder, into wav2vec 2.0's Conformer, replacing MHSA with a global summary and a local branch. The approach achieves comparable or better downstream performance across ASR, intent classification, emotion recognition, and speaker verification tasks while reducing pre-training time by ~18% and peak VRAM by ~23%, enabling a 155M parameter model to train in about 7 days on 4 A100 GPUs. This demonstrates that linear-attention SSL can match or exceed MHSA performance with substantial practical efficiency, highlighting potential for scaling SSL in speech with reduced resource use.

Abstract

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code is available at https://github.com/SamsungLabs/SummaryMixing.

Linear-Complexity Self-Supervised Learning for Speech Processing

TL;DR

Self-supervised speech models incur high pre-training costs due to the quadratic complexity of MHSA. This work introduces SummaryMixing, a linear-complexity context encoder, into wav2vec 2.0's Conformer, replacing MHSA with a global summary and a local branch. The approach achieves comparable or better downstream performance across ASR, intent classification, emotion recognition, and speaker verification tasks while reducing pre-training time by ~18% and peak VRAM by ~23%, enabling a 155M parameter model to train in about 7 days on 4 A100 GPUs. This demonstrates that linear-attention SSL can match or exceed MHSA performance with substantial practical efficiency, highlighting potential for scaling SSL in speech with reduced resource use.

Abstract

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code is available at https://github.com/SamsungLabs/SummaryMixing.
Paper Structure (9 sections, 1 equation, 2 figures, 2 tables)

This paper contains 9 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Architectures of SummaryMixing and the Conformer.
  • Figure 2: The learned weights of the SummaryMixing (left) and MHSA (right) Context encoder hidden representations for downstream tasks. The weights of each column sum to one. Deeper colors indicate larger weights.