Table of Contents
Fetching ...

PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Mengyuan Liu, Jiajie Liu, Jinyan Zhang, Wenhao Li, Junsong Yuan

TL;DR

PoseMoE tackles depth-induced erosion in lifting-based monocular 3D human pose estimation by decoupling 2D pose and depth features into specialized experts and then using a cross-expert decoder to re-integrate complementary information. The PoseMoE Encoder disentangles the two feature streams, aided by Gaussian depth priors, while the PoseMoE Decoder employs bidirectional cross-attention to enable conditional, temporal-spatial knowledge transfer. Across Human3.6M, MPI-INF-3DHP, and 3DPW, PoseMoE achieves state-of-the-art accuracy and robustness with a smaller parameter footprint, validated by extensive ablations and qualitative analyses. The work demonstrates that deliberate decoupling followed by strategic aggregation is effective for ill-posed monocular depth inference and sets a new direction for MoE-based pose estimation research.

Abstract

The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

TL;DR

PoseMoE tackles depth-induced erosion in lifting-based monocular 3D human pose estimation by decoupling 2D pose and depth features into specialized experts and then using a cross-expert decoder to re-integrate complementary information. The PoseMoE Encoder disentangles the two feature streams, aided by Gaussian depth priors, while the PoseMoE Decoder employs bidirectional cross-attention to enable conditional, temporal-spatial knowledge transfer. Across Human3.6M, MPI-INF-3DHP, and 3DPW, PoseMoE achieves state-of-the-art accuracy and robustness with a smaller parameter footprint, validated by extensive ablations and qualitative analyses. The work demonstrates that deliberate decoupling followed by strategic aggregation is effective for ill-posed monocular depth inference and sets a new direction for MoE-based pose estimation research.

Abstract

The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

Paper Structure

This paper contains 17 sections, 20 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: An illustration of our motivation. We project the 2D pose in the camera coordinate (part of the output 3D pose) back to the image coordinate for comparison. The powerful lifting-based method KTPFormer peng2024ktpformer obtains a 2D pose worse than the input, which contradicts our intuition. In contrast, our framework obtains a 2D pose better than the input.
  • Figure 2: Given a 2D pose in the image coordinate, we aim to estimate the 3D pose in the camera coordinate. Left: Conventional lifting-based methods directly project the 2D pose in an entangled feature space and regression the 3D pose from it. Right: Our proposed Mixture-of-Experts network. The 2D pose and depth features are learned separately through expert model. Then, we perform feature interaction to supplement the complementary information between 2D pose feature and depth feature. Finally, we regress the 2D pose and depth and concatenate them to obtain the final 3D pose.
  • Figure 3: Quantitative Comparison of Mean Per Joint Position Error (MPJPE) of different axes for all actions and three hard actions zeng2021learning with SOTA lifting-based methods zheng20243dpeng2024ktpformer. The MPJPE of the Z-axis (depth) is significantly higher than the X-Y axes (2D pose) and accounts for the majority of the overall (3D pose) MPJPE. Our proposed method achieves better results across different axes than the lifting-based framework.
  • Figure 4: Overview of our proposed PoseMoE network. We first project the 2D pose sequence to a high-dimensional feature through a linear embedding. These features are then handed to the PoseMoE Encoder (PME) to generate the refined 2D pose and learned depth features. Subsequently, we fed them into the PoseMoE Decoder (PMD) to establish the connection between 2D pose and depth, obtaining the enhanced 2D pose and depth features. Finally, we use two regression heads to regress the 2D pose and depth sequences, respectively, and concatenate them to obtain the 3D pose sequence.
  • Figure 5: Illustration of proposed 2D pose expert and depth expert. The 2D pose expert takes the 2D pose features and the 2D pose sequence as input, and outputs refined 2D pose features. The depth expert takes the depth features along with a supplementary feature initialized from a Gaussian distribution, and outputs the learned depth features. The supplementary feature is learnable during the training.
  • ...and 9 more figures