Table of Contents
Fetching ...

Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models

Adriana Fernandez-Lopez, Shiwei Liu, Lu Yin, Stavros Petridis, Maja Pantic

TL;DR

The Low-Rank Speech Model from Scratch (LR-SMS) is introduced, an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count, and training time speedups.

Abstract

This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding several notable findings. Firstly, we discover that applying a low-rank structure exclusively to the attention modules can unexpectedly enhance performance, even with a significant rank reduction of 12%. In contrast, feed-forward layers present greater challenges, as they begin to exhibit performance degradation with a moderate 50% rank reduction. Furthermore, we find that both initialization and layer-wise rank assignment play critical roles in successful low-rank training. Specifically, employing SVD initialization and linear layer-wise rank mapping significantly boosts the efficacy of low-rank weight training. Building on these insights, we introduce the Low-Rank Speech Model from Scratch (LR-SMS), an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count (by at least 2x), and training time speedups (by 1.3x for ASR and 1.15x for AVSR).

Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models

TL;DR

The Low-Rank Speech Model from Scratch (LR-SMS) is introduced, an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count, and training time speedups.

Abstract

This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding several notable findings. Firstly, we discover that applying a low-rank structure exclusively to the attention modules can unexpectedly enhance performance, even with a significant rank reduction of 12%. In contrast, feed-forward layers present greater challenges, as they begin to exhibit performance degradation with a moderate 50% rank reduction. Furthermore, we find that both initialization and layer-wise rank assignment play critical roles in successful low-rank training. Specifically, employing SVD initialization and linear layer-wise rank mapping significantly boosts the efficacy of low-rank weight training. Building on these insights, we introduce the Low-Rank Speech Model from Scratch (LR-SMS), an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count (by at least 2x), and training time speedups (by 1.3x for ASR and 1.15x for AVSR).

Paper Structure

This paper contains 10 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Close-up view of the topology of a block of the ASR Conformer Encoder at different training stages. Ratio: number of singular vectors required to approximate 95% of each weight matrix compared to the total number of singular values. A large ratio indicates that most singular vectors are necessary, while a small means few are needed.
  • Figure 2: (a) End-to-end ASR architecture. (b) Feed-Forward Network. (c) Factorized Linear Layer. (d) Multi-Headed Self-Attention.
  • Figure 3: (a) Conformer encoder blocks. (b) Transformer decoder blocks. Ratio: number of singular vectors required to approximate 95% of each weight matrix compared to the total number of singular values. The X-axis represents the model depth from top to end. $b$ is the current block and $B$ the number of blocks. A ratio close to 100% indicates that all singular vectors are necessary, while a ratio close to 0% means none are needed.
  • Figure 4: WER [%] ($\downarrow$) of low-rank ASR models trained from scratch on LRS3. The models are initialized with either SVD or Kaiming hernandez2023sharingwinata2020lightweight. We compare their performance when different uniform scaling factors $\alpha$ are applied to FFNs.