Table of Contents
Fetching ...

Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras

Yu-Jhe Li, Yan Xu, Rawal Khirodkar, Jinhyung Park, Kris Kitani

TL;DR

The paper tackles multi-person 3D pose estimation from a limited set of uncalibrated RGBD cameras. It introduces MVD-HPE, a regression-free pipeline that leverages depth-enabled cross-view correspondences via 3D Re-ID features, plus a depth-guided camera pose estimation and depth-constrained triangulation for robust 3D reconstruction. The authors validate the approach on RGBD datasets collected across Office, Garage, and Classroom sites, showing significant improvements over prior regression-free methods in both camera localization and 3D pose accuracy, especially with sparse views. The work demonstrates the practical potential of uncalibrated depth sensors for accurate multi-person 3D pose estimation in real-world, semi-outdoor-like environments, with implications for smart cities and crowd analytics. Overall, MVD-HPE advances regression-free 3D pose estimation by exploiting depth to enhance cross-view associations and geometric consistency.

Abstract

We tackle the task of multi-view, multi-person 3D human pose estimation from a limited number of uncalibrated depth cameras. Recently, many approaches have been proposed for 3D human pose estimation from multi-view RGB cameras. However, these works (1) assume the number of RGB camera views is large enough for 3D reconstruction, (2) the cameras are calibrated, and (3) rely on ground truth 3D poses for training their regression model. In this work, we propose to leverage sparse, uncalibrated depth cameras providing RGBD video streams for 3D human pose estimation. We present a simple pipeline for Multi-View Depth Human Pose Estimation (MVD-HPE) for jointly predicting the camera poses and 3D human poses without training a deep 3D human pose regression model. This framework utilizes 3D Re-ID appearance features from RGBD images to formulate more accurate correspondences (for deriving camera positions) compared to using RGB-only features. We further propose (1) depth-guided camera-pose estimation by leveraging 3D rigid transformations as guidance and (2) depth-constrained 3D human pose estimation by utilizing depth-projected 3D points as an alternative objective for optimization. In order to evaluate our proposed pipeline, we collect three video sets of RGBD videos recorded from multiple sparse-view depth cameras and ground truth 3D poses are manually annotated. Experiments show that our proposed method outperforms the current 3D human pose regression-free pipelines in terms of both camera pose estimation and 3D human pose estimation.

Multi-Person 3D Pose Estimation from Multi-View Uncalibrated Depth Cameras

TL;DR

The paper tackles multi-person 3D pose estimation from a limited set of uncalibrated RGBD cameras. It introduces MVD-HPE, a regression-free pipeline that leverages depth-enabled cross-view correspondences via 3D Re-ID features, plus a depth-guided camera pose estimation and depth-constrained triangulation for robust 3D reconstruction. The authors validate the approach on RGBD datasets collected across Office, Garage, and Classroom sites, showing significant improvements over prior regression-free methods in both camera localization and 3D pose accuracy, especially with sparse views. The work demonstrates the practical potential of uncalibrated depth sensors for accurate multi-person 3D pose estimation in real-world, semi-outdoor-like environments, with implications for smart cities and crowd analytics. Overall, MVD-HPE advances regression-free 3D pose estimation by exploiting depth to enhance cross-view associations and geometric consistency.

Abstract

We tackle the task of multi-view, multi-person 3D human pose estimation from a limited number of uncalibrated depth cameras. Recently, many approaches have been proposed for 3D human pose estimation from multi-view RGB cameras. However, these works (1) assume the number of RGB camera views is large enough for 3D reconstruction, (2) the cameras are calibrated, and (3) rely on ground truth 3D poses for training their regression model. In this work, we propose to leverage sparse, uncalibrated depth cameras providing RGBD video streams for 3D human pose estimation. We present a simple pipeline for Multi-View Depth Human Pose Estimation (MVD-HPE) for jointly predicting the camera poses and 3D human poses without training a deep 3D human pose regression model. This framework utilizes 3D Re-ID appearance features from RGBD images to formulate more accurate correspondences (for deriving camera positions) compared to using RGB-only features. We further propose (1) depth-guided camera-pose estimation by leveraging 3D rigid transformations as guidance and (2) depth-constrained 3D human pose estimation by utilizing depth-projected 3D points as an alternative objective for optimization. In order to evaluate our proposed pipeline, we collect three video sets of RGBD videos recorded from multiple sparse-view depth cameras and ground truth 3D poses are manually annotated. Experiments show that our proposed method outperforms the current 3D human pose regression-free pipelines in terms of both camera pose estimation and 3D human pose estimation.
Paper Structure (33 sections, 7 equations, 14 figures, 4 tables)

This paper contains 33 sections, 7 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: 3D human pose estimation from multi-view depth cameras. Compared with multi-view RGB cameras, multi-view RGBD cameras provide additional depth information to reconstruct 3D point clouds more precisely.
  • Figure 2: Overview of the proposed pipeline for uncalibrated human pose estimation. The pipeline contains four steps: a) 2D human pose estimation from the off-the-shelf 2D pose detector, b) colored point cloud feature extraction employing a 3D re-ID model, c) depth-guided camera pose estimation, and d) depth-constrained triangulation for 3D human pose estimation.
  • Figure 3: Illustration of 3D appearance feature extraction taking into RGB image and depth image (transformed to point clouds).
  • Figure 4: Illustration of the rigid transformation (a rotation matrix $R$ and a up-to-scale translation $t$) can be resolved between two sets of 3D points. Not all of the 2D body key points can have depth measurements for 3D projected points.
  • Figure 5: Illustration of (a) projected 3D key points from depths and (b) triangulation from the constrained objective.
  • ...and 9 more figures