Table of Contents
Fetching ...

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Yash Jain, David Chan, Pranav Dheram, Aparna Khare, Olabanji Shonibare, Venkatesh Ravichandran, Shalini Ghosh

TL;DR

The paper tackles improving automatic speech recognition through a multi-stage, multi-modal pre-training framework that combines masked autoencoding and contrastive learning with a translation-based mid-training step. It leverages audio-visual data from diverse datasets and introduces a mid-training stage on MuST-C to align speech representations with text, yielding substantial relative improvements in $WER$ on Librispeech and across SUPERB tasks. Key findings show MAE generally outperforms CLR for ASR, translation-based mid-training provides strong gains (notably with Italian as a complementary language), and data composition critically shapes outcomes. The work offers practical guidance on pre-training strategies, dataset selection, and the value of translation-driven mid-training for enhancing multi-modal ASR systems.

Abstract

Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

TL;DR

The paper tackles improving automatic speech recognition through a multi-stage, multi-modal pre-training framework that combines masked autoencoding and contrastive learning with a translation-based mid-training step. It leverages audio-visual data from diverse datasets and introduces a mid-training stage on MuST-C to align speech representations with text, yielding substantial relative improvements in on Librispeech and across SUPERB tasks. Key findings show MAE generally outperforms CLR for ASR, translation-based mid-training provides strong gains (notably with Italian as a complementary language), and data composition critically shapes outcomes. The work offers practical guidance on pre-training strategies, dataset selection, and the value of translation-driven mid-training for enhancing multi-modal ASR systems.

Abstract

Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.
Paper Structure (11 sections, 4 equations, 3 figures, 1 table)

This paper contains 11 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of multi-modal training strategy. Raw audio and video features are extracted from source data. These features are then passed through the audio and video encoders to get features which are further processed as (1) MAE: the masked encoded features are reconstructed through a common decoder successively and are compared against original input using L2 loss, (2) CLR: contrastive learning applied to spatio-temporally pooled audio and video encoded features, and (3) the trained audio encoder is further used for mid-training (translation task) and then for downstream tasks.
  • Figure 2: Aggregate (dataset/language) relative performance improvement (higher is better) under mid-training for MAE + CLR on SUPERB. KS: Keyword spotting, IC: Intent Classification, PR: Phoneme Recognition, SD: Speaker Diarization. We observe consistent improvement in performance due to translation mid-training on tasks which require local feature information (KS, IC and PR) whereas global task SD observe a decrease in performance. It further shows that translation mid-training task enhances the pre-trained model's performance for local feature tasks while hurts the global feature task.
  • Figure 3: Average relative WER improvement on the Librispeech test-clean and test-other datasets with mid-training to show the effect of pre-training methods (left), mid-training translation pairs (center), and pre-training datasets (right). Translation mid-training improves upon CLR pre-training the most as it aligns its features for the local information required for ASR. Among the translation languages, Italian provides the best improvement, suggesting a complimentary language to English gains the most compared to languages that shares its roots with English. Models pre-trained on non-speech dataset Kinetics benefit the most from translation mid-training followed by noisy speech dataset Voxceleb2 and then clean speech dataset LRS3.