Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, Shinji Watanabe
TL;DR
EVA tackles robustness in audiovisual speech recognition for in-the-wild videos by fusing unconstrained full-frame visuals into a strong pretrained SR backbone via a multimodal mixture-of-Experts (MoE). It builds on OWSM v3.1 for speech understanding, uses CLIP to extract visual tokens, and employs a sparse MoE router to integrate audio and visual streams without degrading speech performance. The training objective combines attention and CTC losses with a load-balancing auxiliary term to encourage balanced expert usage, achieving state-of-the-art results across How2, VisSpeech, and Ego4D with relatively modest audiovisual data. This approach demonstrates strong cross-domain generalization and offers a path toward parameter-efficient, robust AVSR in diverse video domains.
Abstract
Visual signals can enhance audiovisual speech recognition accuracy by providing additional contextual information. Given the complexity of visual signals, an audiovisual speech recognition model requires robust generalization capabilities across diverse video scenarios, presenting a significant challenge. In this paper, we introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for ``in-the-wild'' videos. Specifically, we first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Then, we build EVA upon a robust pretrained speech recognition model, ensuring its generalization ability. Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module. Experiments show our model achieves state-of-the-art results on three benchmarks, which demonstrates the generalization ability of EVA across diverse video domains.
