Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
Mengshi Qi, Jiaxuan Peng, Xianlin Zhang, Huadong Ma
TL;DR
This work tackles modality imbalance in multi-modal 3D HPE by introducing a Shapley value–based contribution analysis to quantify each modality's utility during training, paired with an adaptive weight constraint (AWC) regularization guided by the Fisher Information Matrix to balance parameter updates. The method operates without adding extra learnable components and demonstrates superior performance on the MM-Fi dataset compared to naive joint training and existing balancing approaches. Key findings show RGB and LiDAR provide the strongest signals, while weaker modalities are progressively suppressed during training, yielding robust, balanced fusion. The approach offers a practical route to reliable multi-modal 3D pose estimation in non-intrusive sensing scenarios.
Abstract
3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the widely adopted multi-modal dataset, MM-Fi, demonstrating the superiority of our approach in enhancing 3D pose estimation under complex conditions. We will release our codes soon.
