Table of Contents
Fetching ...

Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation

Mengshi Qi, Jiaxuan Peng, Xianlin Zhang, Huadong Ma

TL;DR

This work tackles modality imbalance in multi-modal 3D HPE by introducing a Shapley value–based contribution analysis to quantify each modality's utility during training, paired with an adaptive weight constraint (AWC) regularization guided by the Fisher Information Matrix to balance parameter updates. The method operates without adding extra learnable components and demonstrates superior performance on the MM-Fi dataset compared to naive joint training and existing balancing approaches. Key findings show RGB and LiDAR provide the strongest signals, while weaker modalities are progressively suppressed during training, yielding robust, balanced fusion. The approach offers a practical route to reliable multi-modal 3D pose estimation in non-intrusive sensing scenarios.

Abstract

3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the widely adopted multi-modal dataset, MM-Fi, demonstrating the superiority of our approach in enhancing 3D pose estimation under complex conditions. We will release our codes soon.

Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation

TL;DR

This work tackles modality imbalance in multi-modal 3D HPE by introducing a Shapley value–based contribution analysis to quantify each modality's utility during training, paired with an adaptive weight constraint (AWC) regularization guided by the Fisher Information Matrix to balance parameter updates. The method operates without adding extra learnable components and demonstrates superior performance on the MM-Fi dataset compared to naive joint training and existing balancing approaches. Key findings show RGB and LiDAR provide the strongest signals, while weaker modalities are progressively suppressed during training, yielding robust, balanced fusion. The approach offers a practical route to reliable multi-modal 3D pose estimation in non-intrusive sensing scenarios.

Abstract

3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the widely adopted multi-modal dataset, MM-Fi, demonstrating the superiority of our approach in enhancing 3D pose estimation under complex conditions. We will release our codes soon.
Paper Structure (15 sections, 15 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 15 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Illustration of the multi-modal 3D HPE. (b) In end-to-end training, modality imbalance arises where dominant modalities with higher scores suppress the optimization of others in MM-Fi yang2024mm.
  • Figure 2: Illustration of our proposed methods. Modality contributions are assessed via Shapley Module, and an adaptive weight constraint (AWC) loss, weighted by Fisher information, regularizes encoder parameter updates to slow dominant modalities and protect inferior ones during the critical learning window.
  • Figure 3: (a) Mean and (b) standard deviation of human joint coordinate predictions sampled from MM-Fi yang2024mm during the training process. Notably, inferior modalities, such as mmWave and WiFi, exhibit near-zero standard deviations, indicating limited variability in their predictions.
  • Figure 4: Visualization of contribution scores calculated by our Shapley value-based contribution algorithm using attention-based fusion strategy.
  • Figure 5: Visual comparisons of 3D human pose estimation between OGM-GE and our method on MM-Fi. Red circles indicate joints where our method achieves superior results.
  • ...and 4 more figures