Table of Contents
Fetching ...

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

Andrew Rouditchenko, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

TL;DR

mWhisper-Flamingo extends Whisper-Flamingo to multilingual audiovisual speech recognition by pairing Whisper's multilingual audio encoder/decoder with a multilingual AV-HuBERT visual encoder and introducing decoder modality dropout. The method trains in two stages and uses late fusion via cross-attention, with dropout that alternates between AV, audio-only, and video-only inputs to improve cross-modal integration and robustness. On MuAViC, it achieves state-of-the-art multilingual AVSR performance and consistently outperforms audio-only Whisper in noisy conditions, with the medium model delivering the best multilingual WER of $43.7\%$ under 0-dB babble. Ablation shows that decoder modality dropout and a fine-tunable visual encoder are essential for maximizing multilingual gains, and multilingual AV-HuBERT outperforms English-only variants. The work demonstrates that leveraging multilingual audiovisual data and robust modality integration significantly improves noise-robust multilingual ASR and provides a practical, scalable approach for multilingual AVSR.

Abstract

Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

TL;DR

mWhisper-Flamingo extends Whisper-Flamingo to multilingual audiovisual speech recognition by pairing Whisper's multilingual audio encoder/decoder with a multilingual AV-HuBERT visual encoder and introducing decoder modality dropout. The method trains in two stages and uses late fusion via cross-attention, with dropout that alternates between AV, audio-only, and video-only inputs to improve cross-modal integration and robustness. On MuAViC, it achieves state-of-the-art multilingual AVSR performance and consistently outperforms audio-only Whisper in noisy conditions, with the medium model delivering the best multilingual WER of under 0-dB babble. Ablation shows that decoder modality dropout and a fine-tunable visual encoder are essential for maximizing multilingual gains, and multilingual AV-HuBERT outperforms English-only variants. The work demonstrates that leveraging multilingual audiovisual data and robust modality integration significantly improves noise-robust multilingual ASR and provides a practical, scalable approach for multilingual AVSR.

Abstract

Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.

Paper Structure

This paper contains 10 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: In mWhisper-Flamingo, the AV-HuBERT and Whisper encoders extract visual and audio features from multilingual videos. Separate cross attention layers in Whisper's decoder attend to the visual and audio features. Decoder modality dropout randomly replaces the audio or video features by 0, forcing the decoder to train on video-only and audio-only inputs.
  • Figure 2: Multilingual WER ($\downarrow$ is better) for different noise types averaged over 4 languages (Es, Fr, It, Pt) and 5 SNR levels $\{-10,-5,0,5,10 \}$.