Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Yuchen Yang; Yu Qiao; Xiao Sun

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Yuchen Yang, Yu Qiao, Xiao Sun

TL;DR

This paper tackles unsupervised 3D human pose estimation from monocular RGB by introducing Mask as Supervision, which uses mask information as a supervisory signal to guide 3D keypoint localization. It defines two complementary mask priors—Skeleton Mask (structure) and Physique Mask (shape)—and employs geodesic weighting and cascaded optimization to robustly recover interpretable 3D joints without pose annotations. The approach supports multiple data modalities (video and multi-view) and leverages in-the-wild data to boost generalization, achieving state-of-the-art performance on Human3.6M and MPI-INF-3DHP in fully unsupervised settings. The work advances practical unsupervised 3D pose estimation by enabling annotation-free data usage while producing physically interpretable joints, with solid scalability through diverse data sources.

Abstract

Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. Code will be available at https://github.com/Charrrrrlie/Mask-as-Supervision.

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

TL;DR

Abstract

Paper Structure (33 sections, 11 equations, 7 figures, 3 tables)

This paper contains 33 sections, 11 equations, 7 figures, 3 tables.

Introduction
Related Work
Unsupervised 2D Landmark Detection.
Unsupervised 2D to 3D Lifting.
Unsupervised 3D Pose Estimation.
Method
3D Pose Estimation from Single Images
Mask as Supervision
Problem Definition.
Baseline.
Skeleton Mask from Body Structure Prior.
Physique Mask from Body Shape Prior.
Geodesic Weighting for Hard Positives and Negatives.
Leveraging Priors in Diverse Data Modalities
Video Data Modality.
...and 18 more sections

Figures (7)

Figure 1: The schematic of mask as supervision. For unsupervised pose estimation, we aim to bridge the gap of annotations and utilize attainable data. Meanwhile, the foreground mask is easy to acquire and implies fine-grained information that motivates us.
Figure 2: Different acquisition methods for structural priors. Known as an important prior, structural information leads to plausible skeletons in pose estimation. Without considering it, most previous methods necessitate supervised post-processing (SPP) during inference. To tackle this issue, we propose a method that enables effective and ready-to-use pose estimation in an unsupervised fashion.
Figure 3: Overview. We aim to gain supervision from mask reconstruction for the 3D detector. The Skeleton Mask and the Physique Mask representations are proposed for reconstruction in a coarse-to-refine granularity. Additionally, Geodesic Weighting is adopted to further leverage mask information. Note that only the detector will be used during inference.
Figure 4: Reconstruction loss under different training strategies. The blue and red curves represent the cascade training and its absence, respectively. In addition, cascade stages are differentiated in colors. Losses are scaled for visualization.
Figure 5: Qualitative results on Human3.6M and MPI-INF-3DHP datasets. We visualize joints in 2D and 3D coordinate systems, for Human3.6M (left) and MPI-INF-3DHP (right) datasets. Every $4^{th}$ column in each sub-figure shows 3D ground truth joints.
...and 2 more figures

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

TL;DR

Abstract

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)