UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues
Vandad Davoodnia, Saeed Ghorbani, Marc-André Carbonneau, Alexandre Messier, Ali Etemad
TL;DR
UPose3D tackles multi-view 3D human pose estimation without direct 3D annotations by introducing cross-view projection and a pose compiler that leverages temporal and cross-view cues with uncertainty modeling. It combines a 2D keypoint uncertainty estimator with a cross-view point-cloud encoder and a criss-cross attended spatiotemporal encoder to perform maximum likelihood-based 3D reconstruction using synthetic multi-view data. Training relies on online synthetic sequences generated from motion-capture data, enabling generalization across diverse actors and viewpoints and scalability to varying camera counts. Empirical results show state-of-the-art OoD performance and competitive InD results, with clear improvements from uncertainty modeling, temporal context, and multi-view fusion, while maintaining reasonable computational costs for practical deployment.
Abstract
We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields performance rivalling methods that rely on 3D annotated data while being the state-of-the-art among methods relying only on 2D supervision.
