MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition
Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun
TL;DR
MoHAVE introduces a sparse, modality-aware Mixture of Hierarchical Audio-Visual Experts to address scalability in AVSR. By using a two-layer hierarchical gating mechanism with an inter-modal router and intra-modal routing, MoHAVE dynamically allocates computation to audio or visual expert groups based on input context, guided by a group-load biasing loss. The approach yields state-of-the-art robustness on noisy AVSR benchmarks (e.g., LRS3 with various noise types and SNRs) and strong multilingual performance on MuAViC, while keeping computational costs comparable to smaller AVSR models. This framework demonstrates scalable AVSR that effectively leverages both modalities under adverse conditions, with practical impact for real-world, noisy, and multilingual speech recognition tasks.
Abstract
Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.
