Table of Contents
Fetching ...

Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework

Henrique Jesus, Hugo Proença

TL;DR

The paper introduces a zero-shot, interpretable biometric recognition framework that learns solely from synthetic data by aligning 2D images with 3D person prototypes through a 2D-3D registration process. It learns semantic correspondences by partitioning 3D bodies into 14 parts and optimizing a cosine-based cross-modal objective, enabling human-understandable explanations of decisions. A synthetic data generation pipeline using SMPL-based meshes, head details, VPoser poses, and Blender rendering supports unlimited variation, and the model demonstrates domain generalization to real images with interpretable region correspondences, though clothing and hairstyle variations pose challenges. The approach offers a pathway for legally and forensically robust biometrics by providing semantically grounded explanations of recognition results, with future work aimed at incorporating 3D clothing and richer textual justification.

Abstract

Large vision models based in deep learning architectures have been consistently advancing the state-of-the-art in biometric recognition. However, three weaknesses are commonly reported for such kind of approaches: 1) their extreme demands in terms of learning data; 2) the difficulties in generalising between different domains; and 3) the lack of interpretability/explainability, with biometrics being of particular interest, as it is important to provide evidence able to be used for forensics/legal purposes (e.g., in courts). To the best of our knowledge, this paper describes the first recognition framework/strategy that aims at addressing the three weaknesses simultaneously. At first, it relies exclusively in synthetic samples for learning purposes. Instead of requiring a large amount and variety of samples for each subject, the idea is to exclusively enroll a 3D point cloud per identity. Then, using generative strategies, we synthesize a very large (potentially infinite) number of samples, containing all the desired covariates (poses, clothing, distances, perspectives, lighting, occlusions,...). Upon the synthesizing method used, it is possible to adapt precisely to different kind of domains, which accounts for generalization purposes. Such data are then used to learn a model that performs local registration between image pairs, establishing positive correspondences between body parts that are the key, not only to recognition (according to cardinality and distribution), but also to provide an interpretable description of the response (e.g.: "both samples are from the same person, as they have similar facial shape, hair color and legs thickness").

Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework

TL;DR

The paper introduces a zero-shot, interpretable biometric recognition framework that learns solely from synthetic data by aligning 2D images with 3D person prototypes through a 2D-3D registration process. It learns semantic correspondences by partitioning 3D bodies into 14 parts and optimizing a cosine-based cross-modal objective, enabling human-understandable explanations of decisions. A synthetic data generation pipeline using SMPL-based meshes, head details, VPoser poses, and Blender rendering supports unlimited variation, and the model demonstrates domain generalization to real images with interpretable region correspondences, though clothing and hairstyle variations pose challenges. The approach offers a pathway for legally and forensically robust biometrics by providing semantically grounded explanations of recognition results, with future work aimed at incorporating 3D clothing and richer textual justification.

Abstract

Large vision models based in deep learning architectures have been consistently advancing the state-of-the-art in biometric recognition. However, three weaknesses are commonly reported for such kind of approaches: 1) their extreme demands in terms of learning data; 2) the difficulties in generalising between different domains; and 3) the lack of interpretability/explainability, with biometrics being of particular interest, as it is important to provide evidence able to be used for forensics/legal purposes (e.g., in courts). To the best of our knowledge, this paper describes the first recognition framework/strategy that aims at addressing the three weaknesses simultaneously. At first, it relies exclusively in synthetic samples for learning purposes. Instead of requiring a large amount and variety of samples for each subject, the idea is to exclusively enroll a 3D point cloud per identity. Then, using generative strategies, we synthesize a very large (potentially infinite) number of samples, containing all the desired covariates (poses, clothing, distances, perspectives, lighting, occlusions,...). Upon the synthesizing method used, it is possible to adapt precisely to different kind of domains, which accounts for generalization purposes. Such data are then used to learn a model that performs local registration between image pairs, establishing positive correspondences between body parts that are the key, not only to recognition (according to cardinality and distribution), but also to provide an interpretable description of the response (e.g.: "both samples are from the same person, as they have similar facial shape, hair color and legs thickness").
Paper Structure (16 sections, 2 equations, 9 figures, 1 table)

This paper contains 16 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: We propose an interpretable human recognition framework trained exclusively on synthetic data. In contrast to traditional methods that rely on datasets with limited variety of clothing, poses, and perspectives (e.g.,market), our pipeline generated data with considerable variability. Our model learns by transferring knowledge from the subject in the image to a 3D representation of the same. In the end, it can perform recognition on real data, also providing human understandable explanations for the decisions taken, through registration.
  • Figure 2: Cohesive perspective of the whole pipeline proposed in this paper: In the initial phase, we compile highly detailed 3D meshes of each individual and use them to generate images with a wide range of factors. In the subsequent phase, the model learns to semantically match the individual images with their 3D representation (point cloud) by pulling similar features together and pushing dissimilar features apart.
  • Figure 3: Illustration of the proposed network architecture: the network divides into two branches, one responsible for the image and the other for the point cloud, with feature sharing between them. The image branch has two heads, one responsible for the detector map and the other for the feature map. In the point cloud branch, the features from the image are concatenated, and in the end, the point cloud feature map is returned.
  • Figure 4: Examples of the results attained by our model. The colored regions represent the semantic matching between the individual in the image and the respective point cloud part.
  • Figure 5: Matrix similarities between all individuals varying the threshold parameter $t$. The Y-axis represents the point clouds and X-axis the images. The scale is the correspondence rate $\rho$.
  • ...and 4 more figures