Table of Contents
Fetching ...

Multi-person 3D pose estimation from unlabelled data

Daniel Rodriguez-Criado, Pilar Bachiller, George Vogiatzis, Luis J. Manso

TL;DR

This work tackles multi-person 3D pose estimation from multi-view RGB footage without requiring 3D ground-truth annotations. It introduces a self-supervised two-stage architecture: a Graph Neural Network for cross-view skeleton matching and an MLP that predicts 3D keypoints by fusing multi-view data and a projection-based loss. The approach achieves near-perfect skeleton matching across varying numbers of views and competitive 3D pose accuracy compared to supervised baselines, while delivering real-time performance and adaptability to camera subsets, including deployment on a mobile robot. The findings demonstrate the practicality of unlabelled, RGB-only, multi-camera systems for robust 3D human pose estimation with potential for scalable, scenario-agnostic deployment.

Abstract

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.

Multi-person 3D pose estimation from unlabelled data

TL;DR

This work tackles multi-person 3D pose estimation from multi-view RGB footage without requiring 3D ground-truth annotations. It introduces a self-supervised two-stage architecture: a Graph Neural Network for cross-view skeleton matching and an MLP that predicts 3D keypoints by fusing multi-view data and a projection-based loss. The approach achieves near-perfect skeleton matching across varying numbers of views and competitive 3D pose accuracy compared to supervised baselines, while delivering real-time performance and adaptability to camera subsets, including deployment on a mobile robot. The findings demonstrate the practicality of unlabelled, RGB-only, multi-camera systems for robust 3D human pose estimation with potential for scalable, scenario-agnostic deployment.

Abstract

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.
Paper Structure (14 sections, 6 equations, 8 figures, 7 tables)

This paper contains 14 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Two last stages of the pipeline of the proposed system. The correspondences between the input skeletons in the different views are estimated by the GNN. This information is leveraged by the MLP to provide the final 3D poses.
  • Figure 2: Generation of a sample of the dataset. Graphs of individual persons are generated first assigning a score of $1$ to the match nodes connecting the views (green nodes). Then a final graph is generated from the individual ones adding match nodes with a score of $0$ (red nodes).
  • Figure 3: Representation of how the 3D pose estimation network training loss is computed in a setup with $4$ cameras in $\mathbb{C}_t$ and $3$ cameras in $\mathbb{C}_i$.
  • Figure 4: Pose estimation results for 2 samples of the test sequences using our model (left images) and triangulation (right images). The ground truth is shown in gray. Triangulation provides complete poses in the 2 samples.
  • Figure 5: Pose estimation results for 2 samples of the test sequences using our model (left image) and triangulation (right image). The ground truth is shown in gray. In these samples, triangulation cannot provide complete poses due to an insufficient number of views for some keypoints.
  • ...and 3 more figures