Table of Contents
Fetching ...

DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion

Jianbin Jiao, Xina Cheng, Kailun Yang, Xiangrong Zhang, Licheng Jiao

TL;DR

DeProPose addresses deficiency-aware 3D pose estimation under occlusion, noise, and missing viewpoints by proposing an end-to-end multi-view framework with an adaptive fusion module based on projection error and absolute error. It introduces a Swin Transformer–based Deficiency-Aware Image Encoder and a fusion adaptor that weights per-view features via $\omega_v = 1/(e_{proj}^v + e_{abs}^v + \epsilon)$ to robustly aggregate information across views. The DA-3DPE dataset provides realistic deficiency scenarios, and experiments on Human3.6M and DA-3DPE demonstrate state-of-the-art robustness with significant MPJPE improvements, validating the effectiveness of adaptive cross-view fusion. The work offers practical benefits for surveillance, motion capture, and VR/AR by enabling accurate 3D pose estimation with simplified training and strong resilience to real-world data deficiencies.

Abstract

3D human pose estimation has wide applications in fields such as intelligent surveillance, motion capture, and virtual reality. However, in real-world scenarios, issues such as occlusion, noise interference, and missing viewpoints can severely affect pose estimation. To address these challenges, we introduce the task of Deficiency-Aware 3D Pose Estimation. Traditional 3D pose estimation methods often rely on multi-stage networks and modular combinations, which can lead to cumulative errors and increased training complexity, making them unable to effectively address deficiency-aware estimation. To this end, we propose DeProPose, a flexible method that simplifies the network architecture to reduce training complexity and avoid information loss in multi-stage designs. Additionally, the model innovatively introduces a multi-view feature fusion mechanism based on relative projection error, which effectively utilizes information from multiple viewpoints and dynamically assigns weights, enabling efficient integration and enhanced robustness to overcome deficiency-aware 3D Pose Estimation challenges. Furthermore, to thoroughly evaluate this end-to-end multi-view 3D human pose estimation model and to advance research on occlusion-related challenges, we have developed a novel 3D human pose estimation dataset, termed the Deficiency-Aware 3D Pose Estimation (DA-3DPE) dataset. This dataset encompasses a wide range of deficiency scenarios, including noise interference, missing viewpoints, and occlusion challenges. Compared to state-of-the-art methods, DeProPose not only excels in addressing the deficiency-aware problem but also shows improvement in conventional scenarios, providing a powerful and user-friendly solution for 3D human pose estimation. The source code will be available at https://github.com/WUJINHUAN/DeProPose.

DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion

TL;DR

DeProPose addresses deficiency-aware 3D pose estimation under occlusion, noise, and missing viewpoints by proposing an end-to-end multi-view framework with an adaptive fusion module based on projection error and absolute error. It introduces a Swin Transformer–based Deficiency-Aware Image Encoder and a fusion adaptor that weights per-view features via to robustly aggregate information across views. The DA-3DPE dataset provides realistic deficiency scenarios, and experiments on Human3.6M and DA-3DPE demonstrate state-of-the-art robustness with significant MPJPE improvements, validating the effectiveness of adaptive cross-view fusion. The work offers practical benefits for surveillance, motion capture, and VR/AR by enabling accurate 3D pose estimation with simplified training and strong resilience to real-world data deficiencies.

Abstract

3D human pose estimation has wide applications in fields such as intelligent surveillance, motion capture, and virtual reality. However, in real-world scenarios, issues such as occlusion, noise interference, and missing viewpoints can severely affect pose estimation. To address these challenges, we introduce the task of Deficiency-Aware 3D Pose Estimation. Traditional 3D pose estimation methods often rely on multi-stage networks and modular combinations, which can lead to cumulative errors and increased training complexity, making them unable to effectively address deficiency-aware estimation. To this end, we propose DeProPose, a flexible method that simplifies the network architecture to reduce training complexity and avoid information loss in multi-stage designs. Additionally, the model innovatively introduces a multi-view feature fusion mechanism based on relative projection error, which effectively utilizes information from multiple viewpoints and dynamically assigns weights, enabling efficient integration and enhanced robustness to overcome deficiency-aware 3D Pose Estimation challenges. Furthermore, to thoroughly evaluate this end-to-end multi-view 3D human pose estimation model and to advance research on occlusion-related challenges, we have developed a novel 3D human pose estimation dataset, termed the Deficiency-Aware 3D Pose Estimation (DA-3DPE) dataset. This dataset encompasses a wide range of deficiency scenarios, including noise interference, missing viewpoints, and occlusion challenges. Compared to state-of-the-art methods, DeProPose not only excels in addressing the deficiency-aware problem but also shows improvement in conventional scenarios, providing a powerful and user-friendly solution for 3D human pose estimation. The source code will be available at https://github.com/WUJINHUAN/DeProPose.

Paper Structure

This paper contains 31 sections, 15 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of the proposed framework for multi-view 3D human pose estimation. Images from multiple views (View1, View2, View3, View4) are processed by the feature and relationship extraction module to extract features. These features are fused in the feature fusion module with adaptive weights ($\alpha_1$, $\alpha_2$, $\alpha_3$, $\alpha_4$) to compute losses. The fused features are then passed to the head to predict the final 3D pose. This design leverages multi-view information and adaptively balances contributions from different views.
  • Figure 2: Comparison of different fusion methods. (a) Mathematical fusion amin2013multisolichah2020markerkadkhodamohammadi2021generalizablewan2023viewbartol2022generalizablejiang2023probabilistic: Data is processed based on the principles of epipolar geometry, and fusion is achieved by analyzing the geometric relationships between data from different perspectives. (b) Attention fusion cai2024fusionformershuai2022adaptive: Important data is focused on and selected for fusion using the attention mechanism. (c) Our adaptive weight fusion: Data fusion is achieved by adaptively adjusting the weights of different data sources. It can handle data flexibly according to data characteristics and make full use of the advantages of each data source.
  • Figure 3: This figure illustrates the architecture of the proposed multi-view temporal 3D human pose estimation model. The pipeline begins with the Multi-View Temporal Sequence Data Generator, which processes input video sequences from multiple views into temporal data frames. These frames are then passed through the Feature Extractor to obtain feature representations ($\mathbf{f}$). The extracted features are fed into the Positional Encoder and Temporal Encoder, which encode spatial and temporal information, respectively. A camera-ray-based positional relationship is incorporated to enhance spatial consistency. An Adaptive Weight Adapter dynamically assigns weights ($\mathbf{w}_v$) to features from different views, enabling adaptive fusion of multi-view information. The fused features ($\mathbf{f}_v$) are used to predict the final 3D pose, which is optimized using a multi-term loss function ($L_{\text{total}} = L_1 + L_2 + L_3 + L_4$). This framework effectively combines spatial-temporal relationships and multi-view adaptive fusion to improve pose estimation accuracy and robustness.
  • Figure 4: This figure illustrates examples from the DA-3DPE dataset, a multi-view 3D human pose recognition dataset. It includes $401,017$ normal images, $71,823$ noisy images with three types of noise (salt-and-pepper, Gaussian, speckle), $56,724$ missing images with random occlusions, and $46,125$ occluded images where three of four views are obstructed. This dataset provides diverse scenarios to evaluate and improve model robustness under real-world conditions such as noise, missing data, and occlusion.
  • Figure 5: Comparison of inference time vs. MPJPE on S11 subject. Our model demonstrates significant efficiency improvements over the state-of-the-art methods, including Jiang et al.jiang2023probabilistic, Qiu et al.qiu2019cross, and Shuai et al.shuai2022adaptive, achieving lower inference time while maintaining competitive accuracy.
  • ...and 1 more figures