Table of Contents
Fetching ...

SapiensID: Foundation for Human Recognition

Minchul Kim, Dingqiang Ye, Yiyang Su, Feng Liu, Xiaoming Liu

TL;DR

SapiensID tackles the problem of unifying face and body recognition under wide pose, scale, and visibility variations by introducing Retina Patch (RP), Masked Recognition Model (MRM), and Semantic Attention Head (SAH), trained on the large and diverse WebBody4M dataset. RP enables ROI-aware, region-consistent patch tokenization for Vision Transformers, while MRM accommodates variable token counts through masking with attention scaling and a variable masking rate. SAH provides pose-invariant representations by pooling features around key body parts, aided by predicted keypoints, and the WebBody4M data enables broad generalization across short-term and long-term ReID tasks, including Cross Pose-Scale ReID. The combination yields state-of-the-art results on ReID benchmarks, strong cross-modality capabilities, and a new baseline for holistic human recognition in unconstrained environments, with implications for scalable, privacy-conscious deployments.

Abstract

Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.

SapiensID: Foundation for Human Recognition

TL;DR

SapiensID tackles the problem of unifying face and body recognition under wide pose, scale, and visibility variations by introducing Retina Patch (RP), Masked Recognition Model (MRM), and Semantic Attention Head (SAH), trained on the large and diverse WebBody4M dataset. RP enables ROI-aware, region-consistent patch tokenization for Vision Transformers, while MRM accommodates variable token counts through masking with attention scaling and a variable masking rate. SAH provides pose-invariant representations by pooling features around key body parts, aided by predicted keypoints, and the WebBody4M data enables broad generalization across short-term and long-term ReID tasks, including Cross Pose-Scale ReID. The combination yields state-of-the-art results on ReID benchmarks, strong cross-modality capabilities, and a new baseline for holistic human recognition in unconstrained environments, with implications for scalable, privacy-conscious deployments.

Abstract

Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.

Paper Structure

This paper contains 40 sections, 23 equations, 14 figures, 17 tables.

Figures (14)

  • Figure 1: SapiensID is a human recognition model trained on a large-scale dataset of human images featuring varied poses and visible body parts. For the first time, a single model performs effectively across diverse face and body benchmarks lfwcalfwshu2021largeyang2019person. This marks a significant improvement over previous body recognition models, which were often limited to one specific camera setup or image alignments for one model, with worse performance in in-the-wild scenarios. Additionally, we introduce a large-scale, cross-pose and cross-scale training and evaluation set designed to facilitate further research in this area. --- The name SapiensID pertains to the ability to recognize humans.
  • Figure 2: Conventionally, face and body recognition were handled independently. Also body models are trained on one specific dataset without the ability to generalize to other datasets. SapiensID model for the first time generalizes across modalities and different body poses and camera settings.
  • Figure 3: Comparison between the standard grid patch scheme of Vision Transformers (ViT) and our Retina Patch. While maintaining the same or lower computational budget (number of tokens), Retina Patch dynamically allocates more patches to critical regions (e.g., face and upper torso) in an image. This allocation enhances the model's ability to capture fine-grained details in important regions, and to handle varying scales more effectively than fixed grid patch.
  • Figure 4: Illustration of Retina Patch and Position Encoding computation. Top: It shows three different ROIs generating patches at various scales (e.g., full image, upper torso, face). It also shows the corresponding position encodings sampled from the same spatial locations as the patches, allowing ViT to infer spatial context and understand where each patch originated within the image. Bottom: patches and position embedding created by Retina Patch.
  • Figure 5: Illustration of Masked Recognition Backbone with masking and attention scaling trick for batched input during training. In testing, we pad with mask tokens to make the length the same.
  • ...and 9 more figures