Table of Contents
Fetching ...

Unsupervised Multi-Person 3D Human Pose Estimation From 2D Poses Alone

Peter Hardy, Hansung Kim

TL;DR

This work tackles unsupervised multi-person 2D-to-3D pose estimation from monocular imagery, a setting plagued by perspective ambiguity. It introduces a framework that independently lifts each person’s 2D pose to 3D, then merges them into a shared coordinate system while predicting per-person elevation angles to compensate for camera tilt and enable ground-plane alignment. Key contributions include (i) a novel elevation-angle prediction mechanism for inter-person depth and orientation, (ii) a 3D reconstruction pipeline that remains lightweight enough for real-time use, and (iii) evaluation on the CHI3D dataset with new quantitative metrics to benchmark unsupervised multi-person 2D-3D pose estimation from 2D poses alone. The results establish a baseline for future research and provide a benchmark for unsupervised 3D interaction reconstruction without image data.

Abstract

Current unsupervised 2D-3D human pose estimation (HPE) methods do not work in multi-person scenarios due to perspective ambiguity in monocular images. Therefore, we present one of the first studies investigating the feasibility of unsupervised multi-person 2D-3D HPE from just 2D poses alone, focusing on reconstructing human interactions. To address the issue of perspective ambiguity, we expand upon prior work by predicting the cameras' elevation angle relative to the subjects' pelvis. This allows us to rotate the predicted poses to be level with the ground plane, while obtaining an estimate for the vertical offset in 3D between individuals. Our method involves independently lifting each subject's 2D pose to 3D, before combining them in a shared 3D coordinate system. The poses are then rotated and offset by the predicted elevation angle before being scaled. This by itself enables us to retrieve an accurate 3D reconstruction of their poses. We present our results on the CHI3D dataset, introducing its use for unsupervised 2D-3D pose estimation with three new quantitative metrics, and establishing a benchmark for future research.

Unsupervised Multi-Person 3D Human Pose Estimation From 2D Poses Alone

TL;DR

This work tackles unsupervised multi-person 2D-to-3D pose estimation from monocular imagery, a setting plagued by perspective ambiguity. It introduces a framework that independently lifts each person’s 2D pose to 3D, then merges them into a shared coordinate system while predicting per-person elevation angles to compensate for camera tilt and enable ground-plane alignment. Key contributions include (i) a novel elevation-angle prediction mechanism for inter-person depth and orientation, (ii) a 3D reconstruction pipeline that remains lightweight enough for real-time use, and (iii) evaluation on the CHI3D dataset with new quantitative metrics to benchmark unsupervised multi-person 2D-3D pose estimation from 2D poses alone. The results establish a baseline for future research and provide a benchmark for unsupervised 3D interaction reconstruction without image data.

Abstract

Current unsupervised 2D-3D human pose estimation (HPE) methods do not work in multi-person scenarios due to perspective ambiguity in monocular images. Therefore, we present one of the first studies investigating the feasibility of unsupervised multi-person 2D-3D HPE from just 2D poses alone, focusing on reconstructing human interactions. To address the issue of perspective ambiguity, we expand upon prior work by predicting the cameras' elevation angle relative to the subjects' pelvis. This allows us to rotate the predicted poses to be level with the ground plane, while obtaining an estimate for the vertical offset in 3D between individuals. Our method involves independently lifting each subject's 2D pose to 3D, before combining them in a shared 3D coordinate system. The poses are then rotated and offset by the predicted elevation angle before being scaled. This by itself enables us to retrieve an accurate 3D reconstruction of their poses. We present our results on the CHI3D dataset, introducing its use for unsupervised 2D-3D pose estimation with three new quantitative metrics, and establishing a benchmark for future research.
Paper Structure (10 sections, 3 equations, 4 figures, 1 table)

This paper contains 10 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Errors obtained when trying to use current unsupervised 2D-3D lifting approaches, to lift multiple people to 3D. In the above scenario, the root coordinate is the mid-point between each person's pelvis in 2D. We show a side view of the GT and Predicted 3D to highlight both pose prediction and 3D distance errors. Note how the person further back in the image appears to be floating and smaller in the predicted 3D when compared to the GT, this is due to the depth ambiguity in a perspective projection setting.
  • Figure 2: Overview of our multi-person pose estimation approach. Given two or more detected 2D poses our lifting network hardy predicts the 3D location for each joint for each pose independently. The 3D poses are then combined in their own global coordinate system. An elevation compensation approach accurately predicts the offset of each person's pelvis in a 3D setting. Lastly, each pose is scaled so that their feet are on the same ground plane which produces our final prediction.
  • Figure 3: Top right shows the errors obtained in both scaling and displacement when we assume that the original vertical 2D displacement of the poses (top left) accurately represents the height offset in the real world. The bottom right image shows our proposed elevation compensation approach to displacement and scaling, allowing for more accurate depth offset and scaling to be predicted.
  • Figure 4: Qualitative results on the CHI3D dataset