Table of Contents
Fetching ...

Unsupervised 3D Keypoint Discovery with Multi-View Geometry

Sina Honari, Chen Zhao, Mathieu Salzmann, Pascal Fua

TL;DR

This work introduces a fully unsupervised framework for discovering 3D human keypoints from multi-view, calibrated imagery by leveraging foreground mask reconstruction and multi-view geometry. The pipeline first extracts view-specific features to infer 2D keypoints, triangulates them into 3D keypoints, and then re projects these into each view to refine a foreground mask, all without any joint or mask annotations. A subsequent single-view lifting model maps 2D detections to 3D keypoints, which are then mapped to a target pose using a lightweight regressor, enabling robust pose estimation without supervision. Across Human3.6M and MPI-INF-3DHP, the proposed method achieves state-of-the-art performance among unsupervised approaches, demonstrates strong cross-dataset generalization, and benefits from a carefully designed loss suite that jointly enforces reconstruction, foreground consistency, and geometric feasibility. The approach is practical for uncurated real-world data and lays groundwork for further temporal and multi-foreground extensions.

Abstract

Analyzing and training 3D body posture models depend heavily on the availability of joint labels that are commonly acquired through laborious manual annotation of body joints or via marker-based joint localization using carefully curated markers and capturing systems. However, such annotations are not always available, especially for people performing unusual activities. In this paper, we propose an algorithm that learns to discover 3D keypoints on human bodies from multiple-view images without any supervision or labels other than the constraints multiple-view geometry provides. To ensure that the discovered 3D keypoints are meaningful, they are re-projected to each view to estimate the person's mask that the model itself has initially estimated without supervision. Our approach discovers more interpretable and accurate 3D keypoints compared to other state-of-the-art unsupervised approaches on Human3.6M and MPI-INF-3DHP benchmark datasets.

Unsupervised 3D Keypoint Discovery with Multi-View Geometry

TL;DR

This work introduces a fully unsupervised framework for discovering 3D human keypoints from multi-view, calibrated imagery by leveraging foreground mask reconstruction and multi-view geometry. The pipeline first extracts view-specific features to infer 2D keypoints, triangulates them into 3D keypoints, and then re projects these into each view to refine a foreground mask, all without any joint or mask annotations. A subsequent single-view lifting model maps 2D detections to 3D keypoints, which are then mapped to a target pose using a lightweight regressor, enabling robust pose estimation without supervision. Across Human3.6M and MPI-INF-3DHP, the proposed method achieves state-of-the-art performance among unsupervised approaches, demonstrates strong cross-dataset generalization, and benefits from a carefully designed loss suite that jointly enforces reconstruction, foreground consistency, and geometric feasibility. The approach is practical for uncurated real-world data and lays groundwork for further temporal and multi-foreground extensions.

Abstract

Analyzing and training 3D body posture models depend heavily on the availability of joint labels that are commonly acquired through laborious manual annotation of body joints or via marker-based joint localization using carefully curated markers and capturing systems. However, such annotations are not always available, especially for people performing unusual activities. In this paper, we propose an algorithm that learns to discover 3D keypoints on human bodies from multiple-view images without any supervision or labels other than the constraints multiple-view geometry provides. To ensure that the discovered 3D keypoints are meaningful, they are re-projected to each view to estimate the person's mask that the model itself has initially estimated without supervision. Our approach discovers more interpretable and accurate 3D keypoints compared to other state-of-the-art unsupervised approaches on Human3.6M and MPI-INF-3DHP benchmark datasets.
Paper Structure (28 sections, 17 equations, 8 figures, 9 tables)

This paper contains 28 sections, 17 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Multi-View Geometry for Unsupervised 3D Keypoint Discovery. Our approach finds unsupervised 2D keypoints in each view and then uses multi-view geometry to construct 3D keypoints. These discovered keypoints (observed above), whose location is learned without any supervision, can be later mapped to the final pose of interest (e.g. the joint locations).
  • Figure 2: Approach. Given images from different views and an estimate of the background image, the model first detects and crops the subject in each view. The cropped patch is then passed to an encoder that encodes only the foreground subject information $\Xi$ through reconstruction of the input image by prediction of the foreground mask $\mathbf{M}_{\mathbf{p}}$. This constitutes the image-reconstruction path. The encoded features $\Xi$ are then used to discover 2D keypoints $\mathbf{x}$ by applying a 2D soft-argmax to each keypoint channel. The 2D keypoints from different views are then triangulated to obtain 3D keypoints $\mathbf{X}$ in the world-coordinate using full camera projection matrices, which are then projected separately to each view to obtain the view-specific 2D keypoints $\hat{\mathbf{x}}$. These 2D keypoints are then used to construct a mask $\tilde{\mathbf{M}}_{\mathbf{p}}$ by minimizing its difference to the mask $\mathbf{M}_{\mathbf{p}}$ predicted by the model itself in the image-reconstruction path. No label is used for subject detection, mask reconstruction, or keypoint estimation. The dashed line indicate trainable model components.
  • Figure 3: 2D and 3D keypoints found by a 32-keypoint prediction model on H36M. The 2D keypoints are consistent across views and the 3D keypoints capture the posture of the person, which indicates that they correlate with the person's pose.
  • Figure 4: 2D and 3D keypoints found by 16- (left) and 32- (right) keypoint prediction models on 3DHP. As in Fig. \ref{['fig:kpts_3D_H36M']}, the 2D keypoints are consistent across views and the 3D keypoints capture the posture of the person.
  • Figure 5: Comparison against the latent features of the pre-trained ImageNet model on H36M. The three plots from left to right show depict MPJPE, NMPJPE, and PMPJPE (in mm) for different percentage of labeled 3D data. Both models use a ResNet50 encoder He16a and leverage a 2-hidden layer MLP to regress either the 3D keypoints (Ours) or the latent features---features before the classification layer in ImageNet---to the target 3D pose.
  • ...and 3 more figures