Table of Contents
Fetching ...

SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation

Vinkle Srivastav, Keqi Chen, Nicolas Padoy

TL;DR

SelfPose3d tackles multi-person, multi-view 3d pose estimation without requiring 2d or 3d ground-truth poses. It frames 3d pose estimation as a differentiable bottleneck problem, rendering 3d poses into 2d joints and heatmaps across views and enforcing geometric constraints via cross-affine-view learning, synthetic root localization, and adaptive supervision to handle noisy pseudo labels. The method achieves competitive results on Panoptic, Shelf, and Campus benchmarks compared to fully-supervised approaches, while reducing reliance on 3d ground-truth data and enabling robust cross-scene generalization. This work offers a practical path toward scalable 3d pose estimation in multi-camera setups by harmonizing learning-based representation with geometric supervision and self-supervised cues.

Abstract

We present a new self-supervised approach, SelfPose3d, for estimating 3d poses of multiple persons from multiple camera views. Unlike current state-of-the-art fully-supervised methods, our approach does not require any 2d or 3d ground-truth poses and uses only the multi-view input images from a calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d human pose estimator. We propose two self-supervised learning objectives: self-supervised person localization in 3d space and self-supervised 3d pose estimation. We achieve self-supervised 3d person localization by training the model on synthetically generated 3d points, serving as 3d person root positions, and on the projected root-heatmaps in all the views. We then model the 3d poses of all the localized persons with a bottleneck representation, map them onto all views obtaining 2d joints, and render them using 2d Gaussian heatmaps in an end-to-end differentiable manner. Afterwards, we use the corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive supervision attention mechanism to guide the self-supervision. Our experiments and analysis on three public benchmark datasets, including Panoptic, Shelf, and Campus, show the effectiveness of our approach, which is comparable to fully-supervised methods. Code: https://github.com/CAMMA-public/SelfPose3D. Video demo: https://youtu.be/GAqhmUIr2E8.

SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation

TL;DR

SelfPose3d tackles multi-person, multi-view 3d pose estimation without requiring 2d or 3d ground-truth poses. It frames 3d pose estimation as a differentiable bottleneck problem, rendering 3d poses into 2d joints and heatmaps across views and enforcing geometric constraints via cross-affine-view learning, synthetic root localization, and adaptive supervision to handle noisy pseudo labels. The method achieves competitive results on Panoptic, Shelf, and Campus benchmarks compared to fully-supervised approaches, while reducing reliance on 3d ground-truth data and enabling robust cross-scene generalization. This work offers a practical path toward scalable 3d pose estimation in multi-camera setups by harmonizing learning-based representation with geometric supervision and self-supervised cues.

Abstract

We present a new self-supervised approach, SelfPose3d, for estimating 3d poses of multiple persons from multiple camera views. Unlike current state-of-the-art fully-supervised methods, our approach does not require any 2d or 3d ground-truth poses and uses only the multi-view input images from a calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d human pose estimator. We propose two self-supervised learning objectives: self-supervised person localization in 3d space and self-supervised 3d pose estimation. We achieve self-supervised 3d person localization by training the model on synthetically generated 3d points, serving as 3d person root positions, and on the projected root-heatmaps in all the views. We then model the 3d poses of all the localized persons with a bottleneck representation, map them onto all views obtaining 2d joints, and render them using 2d Gaussian heatmaps in an end-to-end differentiable manner. Afterwards, we use the corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive supervision attention mechanism to guide the self-supervision. Our experiments and analysis on three public benchmark datasets, including Panoptic, Shelf, and Campus, show the effectiveness of our approach, which is comparable to fully-supervised methods. Code: https://github.com/CAMMA-public/SelfPose3D. Video demo: https://youtu.be/GAqhmUIr2E8.
Paper Structure (34 sections, 11 equations, 7 figures, 16 tables)

This paper contains 34 sections, 11 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Our self-supervised approach, called SelfPose3d, estimates multi-person 3d poses from multi-view images and pseudo 2d poses generated using an off-the-shelf 2d human pose estimator. We propose a self-supervised learning objective that generates differentiable and geometrically constrained 2d joints and heatmaps across multiple views from bottleneck 3d poses. On the right, we show 3d pose outputs from our approach along with estimated body meshes (using SMPL body mesh fitting on 3d poses loper2015smplbogo2016keep) and the projected 2d poses.
  • Figure 2: Illustrating our self-supervised SelfPose3d approaches for multi-view multi-person 3d pose estimation. Instead of using ground-truth 3d poses for learning, we propose self-supervised learning objectives to localize 3d roots (mid-hip location of the person) and estimate their 3d poses. We utilize a synthetic 3d roots dataset, two different affine transformations on the multi-view input images ($t_{r,s}^1, t_{r,s}^2$ parametrized by rotation $r$ and scale $s$), a differentiable cross-affine-view 2d joints and heatmaps rendering from the bottleneck 3d poses, and an adaptive supervision attention mechanism to automatically learn the 3d poses in world-space.
  • Figure 3: Comparing ground-truth 2d poses generated by projecting the ground-truth 3d poses to each multi-view image and our pseudo 2d poses generated by running HRNet human pose estimation model sun2019deep on the training dataset. Pseudo 2d poses contain localization errors due to occlusion (see the red arrows), and ground-truth 2d poses exist for partially or even entirely occluded persons (see the blue dotted arrows).
  • Figure 4: Qualitative results for the 3d pose estimations, 2d projections on the multi-view images, and estimated SMPL body shapes on some example images from the Panoptic dataset
  • Figure 5: Comparing the visualization of the output 3d poses during epoch 1, using $L_2$ heatmap loss and $L_1$ joint loss respectively.
  • ...and 2 more figures