Table of Contents
Fetching ...

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu, Maja Pantic

TL;DR

MSRS tackles the challenge of training visual and audio-visual speech recognition models from scratch by introducing a differentiable sparse mask that rapidly identifies a pruning topology to stabilize gradient flow. The mask is learned jointly with model parameters using a differentiable relaxation $m_l = \sigma(l \phi)$ and a two-temperature optimization strategy, with the loss $\mathcal{L}(m_l,\theta) = \mathcal{L}_{att} + \gamma \mathcal{L}_{ctc} + \lambda {\mathbf{1}}_p^{\top} \phi$; after a few epochs, a near-binary mask $m_*$ is applied, allowing either a transition back to dense training or continued sparse updates. Empirically, MSRS yields competitive end-to-end VSR results on LRS3 (e.g., 21.1% WER) and AVSR results (e.g., 0.9% WER), with at least 2x training time reduction and robust performance in low-data and noisy conditions, outperforming several sparse-training baselines. The approach enables training large multimodal speech models without pretraining or large-scale data and complements techniques like LayerScale to further improve gradient flow, while supporting low-precision computation.

Abstract

Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

TL;DR

MSRS tackles the challenge of training visual and audio-visual speech recognition models from scratch by introducing a differentiable sparse mask that rapidly identifies a pruning topology to stabilize gradient flow. The mask is learned jointly with model parameters using a differentiable relaxation and a two-temperature optimization strategy, with the loss ; after a few epochs, a near-binary mask is applied, allowing either a transition back to dense training or continued sparse updates. Empirically, MSRS yields competitive end-to-end VSR results on LRS3 (e.g., 21.1% WER) and AVSR results (e.g., 0.9% WER), with at least 2x training time reduction and robust performance in low-data and noisy conditions, outperforming several sparse-training baselines. The approach enables training large multimodal speech models without pretraining or large-scale data and complements techniques like LayerScale to further improve gradient flow, while supporting low-precision computation.

Abstract

Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.

Paper Structure

This paper contains 13 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: A close-up view of half of VSR training process. L2 gradient norm of the first feed-forward layer in the encoder of multiple models trained with various regularization strategies. Assume uniform initializations, if no pre-training is specified.
  • Figure 2: Impact of $\lambda$ on VSR trainings.
  • Figure 3: Sparsity distribution in the frontend, an encoder block and a decoder block using various approaches: MSRS, ERK (for SET, RigL, and CHASE) and GMP. For encoder and decoder, the average/block is presented, respectively. Non-prunable layers like normalization layers are excluded.