Unsupervised 3D Keypoint Discovery with Multi-View Geometry
Sina Honari, Chen Zhao, Mathieu Salzmann, Pascal Fua
TL;DR
This work introduces a fully unsupervised framework for discovering 3D human keypoints from multi-view, calibrated imagery by leveraging foreground mask reconstruction and multi-view geometry. The pipeline first extracts view-specific features to infer 2D keypoints, triangulates them into 3D keypoints, and then re projects these into each view to refine a foreground mask, all without any joint or mask annotations. A subsequent single-view lifting model maps 2D detections to 3D keypoints, which are then mapped to a target pose using a lightweight regressor, enabling robust pose estimation without supervision. Across Human3.6M and MPI-INF-3DHP, the proposed method achieves state-of-the-art performance among unsupervised approaches, demonstrates strong cross-dataset generalization, and benefits from a carefully designed loss suite that jointly enforces reconstruction, foreground consistency, and geometric feasibility. The approach is practical for uncurated real-world data and lays groundwork for further temporal and multi-foreground extensions.
Abstract
Analyzing and training 3D body posture models depend heavily on the availability of joint labels that are commonly acquired through laborious manual annotation of body joints or via marker-based joint localization using carefully curated markers and capturing systems. However, such annotations are not always available, especially for people performing unusual activities. In this paper, we propose an algorithm that learns to discover 3D keypoints on human bodies from multiple-view images without any supervision or labels other than the constraints multiple-view geometry provides. To ensure that the discovered 3D keypoints are meaningful, they are re-projected to each view to estimate the person's mask that the model itself has initially estimated without supervision. Our approach discovers more interpretable and accurate 3D keypoints compared to other state-of-the-art unsupervised approaches on Human3.6M and MPI-INF-3DHP benchmark datasets.
