Table of Contents
Fetching ...

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

TL;DR

MoHAVE introduces a sparse, modality-aware Mixture of Hierarchical Audio-Visual Experts to address scalability in AVSR. By using a two-layer hierarchical gating mechanism with an inter-modal router and intra-modal routing, MoHAVE dynamically allocates computation to audio or visual expert groups based on input context, guided by a group-load biasing loss. The approach yields state-of-the-art robustness on noisy AVSR benchmarks (e.g., LRS3 with various noise types and SNRs) and strong multilingual performance on MuAViC, while keeping computational costs comparable to smaller AVSR models. This framework demonstrates scalable AVSR that effectively leverages both modalities under adverse conditions, with practical impact for real-world, noisy, and multilingual speech recognition tasks.

Abstract

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

TL;DR

MoHAVE introduces a sparse, modality-aware Mixture of Hierarchical Audio-Visual Experts to address scalability in AVSR. By using a two-layer hierarchical gating mechanism with an inter-modal router and intra-modal routing, MoHAVE dynamically allocates computation to audio or visual expert groups based on input context, guided by a group-load biasing loss. The approach yields state-of-the-art robustness on noisy AVSR benchmarks (e.g., LRS3 with various noise types and SNRs) and strong multilingual performance on MuAViC, while keeping computational costs comparable to smaller AVSR models. This framework demonstrates scalable AVSR that effectively leverages both modalities under adverse conditions, with practical impact for real-world, noisy, and multilingual speech recognition tasks.

Abstract

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

Paper Structure

This paper contains 40 sections, 14 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Comparison of AVSR models based on standard Transformers (AV-HuBERT, shi2022learning), MoE, and MoHAVE, evaluated under babble noise. The MoE structure boosts the model capacity while maintaining the number of activations. MoHAVE-Base (359M) achieves similar performance to AV-HuBERT-Large (477M) while activating only 189M parameters.
  • Figure 2: Overview of sparsely-gated MoE for AVSR. A select subset of experts are activated for each token representation ($x_t$).
  • Figure 3: MoE-based routing strategies for AVSR. (a) A conventional MoE approach where a learned router selects the top-2 experts for each token, enforcing the balanced expert load. (b) Experts are explicitly divided into audio and visual groups, with manual activation based on the input modality. (c) MoHAVE introduces an inter-modal router that can dynamically assign weights to modality-specific expert groups, followed by intra-modal routers that select the top-1 expert within each group. The inter-modal router is trained by the load biasing loss that guides the expert group specialization.
  • Figure 4: (a) Expert load distribution in MoHAVE according to input modalities, with expert selection frequencies weighted by the inter-modal router’s output probability. (b) Performance of the hard routing strategy under different weight assignments to audio expert group. The visual expert group is weighted by $p^V = 1-p^A$.
  • Figure 5: Expert load distribution in MoHAVE for the audio group (solid bars) and visual group (dashed bars) across noisy audio-visual sequences under babble (left) and natural (right) noise. Full layer-wise results are provided in Appendix \ref{['appx:expert_group_usage']}.
  • ...and 2 more figures