mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Andrew Rouditchenko, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
TL;DR
mWhisper-Flamingo extends Whisper-Flamingo to multilingual audiovisual speech recognition by pairing Whisper's multilingual audio encoder/decoder with a multilingual AV-HuBERT visual encoder and introducing decoder modality dropout. The method trains in two stages and uses late fusion via cross-attention, with dropout that alternates between AV, audio-only, and video-only inputs to improve cross-modal integration and robustness. On MuAViC, it achieves state-of-the-art multilingual AVSR performance and consistently outperforms audio-only Whisper in noisy conditions, with the medium model delivering the best multilingual WER of $43.7\%$ under 0-dB babble. Ablation shows that decoder modality dropout and a fine-tunable visual encoder are essential for maximizing multilingual gains, and multilingual AV-HuBERT outperforms English-only variants. The work demonstrates that leveraging multilingual audiovisual data and robust modality integration significantly improves noise-robust multilingual ASR and provides a practical, scalable approach for multilingual AVSR.
Abstract
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
