PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation
Mengyuan Liu, Jiajie Liu, Jinyan Zhang, Wenhao Li, Junsong Yuan
TL;DR
PoseMoE tackles depth-induced erosion in lifting-based monocular 3D human pose estimation by decoupling 2D pose and depth features into specialized experts and then using a cross-expert decoder to re-integrate complementary information. The PoseMoE Encoder disentangles the two feature streams, aided by Gaussian depth priors, while the PoseMoE Decoder employs bidirectional cross-attention to enable conditional, temporal-spatial knowledge transfer. Across Human3.6M, MPI-INF-3DHP, and 3DPW, PoseMoE achieves state-of-the-art accuracy and robustness with a smaller parameter footprint, validated by extensive ablations and qualitative analyses. The work demonstrates that deliberate decoupling followed by strategic aggregation is effective for ill-posed monocular depth inference and sets a new direction for MoE-based pose estimation research.
Abstract
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
