Table of Contents
Fetching ...

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues

Vandad Davoodnia, Saeed Ghorbani, Marc-André Carbonneau, Alexandre Messier, Ali Etemad

TL;DR

UPose3D tackles multi-view 3D human pose estimation without direct 3D annotations by introducing cross-view projection and a pose compiler that leverages temporal and cross-view cues with uncertainty modeling. It combines a 2D keypoint uncertainty estimator with a cross-view point-cloud encoder and a criss-cross attended spatiotemporal encoder to perform maximum likelihood-based 3D reconstruction using synthetic multi-view data. Training relies on online synthetic sequences generated from motion-capture data, enabling generalization across diverse actors and viewpoints and scalability to varying camera counts. Empirical results show state-of-the-art OoD performance and competitive InD results, with clear improvements from uncertainty modeling, temporal context, and multi-view fusion, while maintaining reasonable computational costs for practical deployment.

Abstract

We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields performance rivalling methods that rely on 3D annotated data while being the state-of-the-art among methods relying only on 2D supervision.

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues

TL;DR

UPose3D tackles multi-view 3D human pose estimation without direct 3D annotations by introducing cross-view projection and a pose compiler that leverages temporal and cross-view cues with uncertainty modeling. It combines a 2D keypoint uncertainty estimator with a cross-view point-cloud encoder and a criss-cross attended spatiotemporal encoder to perform maximum likelihood-based 3D reconstruction using synthetic multi-view data. Training relies on online synthetic sequences generated from motion-capture data, enabling generalization across diverse actors and viewpoints and scalability to varying camera counts. Empirical results show state-of-the-art OoD performance and competitive InD results, with clear improvements from uncertainty modeling, temporal context, and multi-view fusion, while maintaining reasonable computational costs for practical deployment.

Abstract

We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields performance rivalling methods that rely on 3D annotated data while being the state-of-the-art among methods relying only on 2D supervision.
Paper Structure (30 sections, 1 equation, 12 figures, 7 tables)

This paper contains 30 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: We illustrate the key stages of UPose3D. It begins with extracting 2D keypoints and uncertainties from the multi-view videos. The keypoints are then projected onto each reference view using epipolar geometry. Our pose compiler is then used to refine the predictions by leveraging cross-view and spatiotemporal information. Finally, the 3D pose is obtained using the keypoint and uncertainty predictions of each stage.
  • Figure 2: Architecture of the proposed pose compiler module consisting of a point cloud encoder and a spatiotemporal encoder with criss-cross attention. Tensor sizes depend on the batch size $B$, temporal window length $T$, number of joints $J$, camera views $V$ and the point cloud feature dimensionality $H$.
  • Figure 3: We demonstrate the scalability of UPose3D to the number of cameras.
  • Figure 4: Illustration of UPose3D on Human3.6m ionescu2013human3 (right) and RICH huang2022capturing (left) datasets, showing the accurate 3D pose estimated by our UPose3D (top) compared to ground-truth (bottom).
  • Figure 5: We illustrate our multi-view data synthesis framework, starting with (a) camera placement in a space surrounding a motion-captured human body; (b) extraction and projection of keypoints onto the synthetic cameras; (c) 2D ground-truth keypoints; (d) data corruption; and (e) cross-view projection to prepare the point cloud training data for our pose compiler.
  • ...and 7 more figures