MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment
Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, Jinglin Xu
TL;DR
This work tackles incomplete multimodal Action Quality Assessment by introducing MCMoE, a single-stage framework that unifies Missing Modality Completion (MMC) with a Mixture of Experts (MoE). An Adaptive Gated Modality Generator reconstructs missing modalities from available ones, while unimodal experts and a soft router dynamically fuse modality-specific and cross-modal information; a Shared Temporal Enhancement Module mitigates semantic gaps. The model uses a grade-based regression with multiple losses to align modalities and encourage diverse grade patterns, achieving state-of-the-art results on three benchmarks in both complete and incomplete settings. By reducing reliance on heavy generative models and enabling robust cross-modal fusion, MCMoE offers a practical solution for real-world AQA where sensor failures or privacy constraints cause modality absence.
Abstract
Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.
