Table of Contents
Fetching ...

Sapiens: Foundation for Human Vision Models

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito

TL;DR

Sapiens introduces a foundation-model approach for four core human-centric vision tasks by pretraining large Vision Transformers on a massive, human-focused dataset with MAE at 1024×1024 resolution. Through simple fine-tuning of lightweight task heads, the model delivers state-of-the-art performance across 2D pose, body-part segmentation, monocular depth, and surface normals, with strong generalization to in-the-wild data. Key contributions include the Humans-300M dataset composition, high-fidelity pretraining, and extensive demonstrations of cross-task effectiveness and synthetic-data augmentation. This work argues for domain-specific, large-scale pretraining as a pathway to robust, scalable human-centric vision foundations applicable with minimal per-task engineering.

Abstract

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error. Project page: https://about.meta.com/realitylabs/codecavatars/sapiens.

Sapiens: Foundation for Human Vision Models

TL;DR

Sapiens introduces a foundation-model approach for four core human-centric vision tasks by pretraining large Vision Transformers on a massive, human-focused dataset with MAE at 1024×1024 resolution. Through simple fine-tuning of lightweight task heads, the model delivers state-of-the-art performance across 2D pose, body-part segmentation, monocular depth, and surface normals, with strong generalization to in-the-wild data. Key contributions include the Humans-300M dataset composition, high-fidelity pretraining, and extensive demonstrations of cross-task effectiveness and synthetic-data augmentation. This work argues for domain-specific, large-scale pretraining as a pathway to robust, scalable human-centric vision foundations applicable with minimal per-task engineering.

Abstract

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error. Project page: https://about.meta.com/realitylabs/codecavatars/sapiens.
Paper Structure (17 sections, 2 equations, 11 figures, 7 tables)

This paper contains 17 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Sapiens models are finetuned for four human tasks - 2D pose estimation, body-part segmentation, depth prediction and normal prediction. Our models generalize across a variety of in-the-wild face, upper-body, full-body and multi-person images.
  • Figure 2: Overview of number of humans per image in the Humans-300M dataset.
  • Figure 3: Sapiens reconstruction on unseen images. Top: Each triplet contains the ground truth (left), the masked image (center), and the MAE reconstruction (right), with a masking ratio of $75\%$, a patch size of $16$, and an image size of $1024$. Bottom: Varying the mask ratio between [0.75, 0.95] during inference reveals a minimal reduction in quality, underscoring the model's understanding of human images.
  • Figure 4: Ground-truth annotations for 2D pose estimation and body-part segmentation.
  • Figure 5: Ground-truth synthetic annotations for depth and surface normal estimation.
  • ...and 6 more figures