Table of Contents
Fetching ...

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic

TL;DR

MoME tackles the computational and flexibility challenges of large-language-model–based AVSR by fusing Matryoshka representations with sparse Mixture-of-Experts. It generates multi-scale audio-visual tokens and trains a shared-router MoME module that jointly learns across scales, enabling efficient inference without retraining for each compression rate. The approach achieves state-of-the-art results on LRS2 and LRS3 while using far fewer active parameters and showing robustness to noise, supported by ablations and visualizations that reveal cross-scale, cross-modal alignment. These findings offer a scalable, interpretable solution for resource-aware speech recognition and potential extension to other multimodal tasks.

Abstract

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

TL;DR

MoME tackles the computational and flexibility challenges of large-language-model–based AVSR by fusing Matryoshka representations with sparse Mixture-of-Experts. It generates multi-scale audio-visual tokens and trains a shared-router MoME module that jointly learns across scales, enabling efficient inference without retraining for each compression rate. The approach achieves state-of-the-art results on LRS2 and LRS3 while using far fewer active parameters and showing robustness to noise, supported by ablations and visualizations that reveal cross-scale, cross-modal alignment. These findings offer a scalable, interpretable solution for resource-aware speech recognition and potential extension to other multimodal tasks.

Abstract

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

Paper Structure

This paper contains 30 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison of MoME with SOTA methods in terms of WER, number of activated parameters, and training data hours on LRS3 dataset. MoME achieves performance parity with or outperforms recent AVSR models while training on a lesser amount of hours, activating fewer parameters and catering to user's resource constraints with a single set of model weights.
  • Figure 2: Overview of our proposed MoME module. We start by producing audio-visual tokens at different scales via modality-specific pre-trained encoders and projectors. Each Matryoshka sequence goes through MoME, which can be placed parallel to multiple modules within each LLM layer (parallel to the MHSA module in the Figure). Each MoME module comprises a top-k router, which sparsely activates a subset of routed experts, and a pool of shared experts to capture scale-invariant knowledge.
  • Figure 3: MoME-23/4 results for VSR and ASR tasks on the LRS3 dataset.
  • Figure 4: Intra-modality and cross-modality correlation matrices for MoME-15/3 trained on LRS2. We study the sentence: "it's a long way from home".
  • Figure 5: MoME-15/3-MHSA expert activation analysis across multiple scales and layers on LRS2.
  • ...and 4 more figures