Table of Contents
Fetching ...

MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, Jinglin Xu

TL;DR

This work tackles incomplete multimodal Action Quality Assessment by introducing MCMoE, a single-stage framework that unifies Missing Modality Completion (MMC) with a Mixture of Experts (MoE). An Adaptive Gated Modality Generator reconstructs missing modalities from available ones, while unimodal experts and a soft router dynamically fuse modality-specific and cross-modal information; a Shared Temporal Enhancement Module mitigates semantic gaps. The model uses a grade-based regression with multiple losses to align modalities and encourage diverse grade patterns, achieving state-of-the-art results on three benchmarks in both complete and incomplete settings. By reducing reliance on heavy generative models and enabling robust cross-modal fusion, MCMoE offers a practical solution for real-world AQA where sensor failures or privacy constraints cause modality absence.

Abstract

Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.

MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

TL;DR

This work tackles incomplete multimodal Action Quality Assessment by introducing MCMoE, a single-stage framework that unifies Missing Modality Completion (MMC) with a Mixture of Experts (MoE). An Adaptive Gated Modality Generator reconstructs missing modalities from available ones, while unimodal experts and a soft router dynamically fuse modality-specific and cross-modal information; a Shared Temporal Enhancement Module mitigates semantic gaps. The model uses a grade-based regression with multiple losses to align modalities and encourage diverse grade patterns, achieving state-of-the-art results on three benchmarks in both complete and incomplete settings. By reducing reliance on heavy generative models and enabling robust cross-modal fusion, MCMoE offers a practical solution for real-world AQA where sensor failures or privacy constraints cause modality absence.

Abstract

Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.

Paper Structure

This paper contains 22 sections, 19 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: (a) Existing two-stage methods first learn unimodal features from complete multimodal data and then model cross-modal representations to address missing data, leading to higher training cost and complexity. (b) Our MCMoE unifies unimodal and joint representation learning in a single stage by exploiting the complementarity between modality completion and mixture of experts.
  • Figure 2: Overview of our missing completion framework with mixture of experts (MCMoE). Following the SOTA multimodal AQA method PAMFN, we use RGB, Flow, and Audio inputs. All modalities are visible during training, and the missing inputs during inference are zero-vector initialized. Frozen modality-specific extractors extract features, enhanced by a shared temporal enhancement module to bridge cross-modal gaps. Random masking simulates modality incompleteness during training and an adaptive gated modality generator completes missing representations. Then, unimodal experts and a soft router enable dynamic fusion, followed by cross-modal integration and grade-based regression for score prediction. (Best viewed in color.)
  • Figure 3: The illustration of our proposed Adaptive Gated Modality Generator (AGMG).
  • Figure 4: Comparisons of performance with complete modalities. * indicates our reimplementation based on the official code.
  • Figure 5: The t-SNE grade distributions in the three extreme unimodal scenes contrasting without AGMG and MoE.
  • ...and 3 more figures