Table of Contents
Fetching ...

A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification

Basudha Pal, Siyuan Huang, Rama Chellappa

TL;DR

This work investigates how body attributes are encoded in vision-language-based person ReID representations by extending the notion of expressivity to the ReID domain and quantifying attribute information via Mutual Information Neural Estimation (MINE). The authors apply this MI-based expressivity framework to ViT-based ReID models (e.g., SemReID, PFD, DC-Former) using an augmented feature–attribute input and a neural estimator to measure I_theta(F,A) across layers and training epochs. Key findings show that BMI consistently exhibits the highest expressivity, especially in deeper layers, while yaw and pitch are more prominent in mid-layers and tend to be suppressed with training; gender remains moderately entangled but relatively stable. The work provides a principled, post-hoc explanation tool for attribute-driven correlations in ReID, with practical implications for fairness and robustness in open-set deployment, while acknowledging MI-based estimates may be influenced by attribute entropy.

Abstract

Person Re-identification (ReID) systems that match individuals across images or video frames are essential in many real-world applications. However, existing methods are often influenced by attributes such as gender, pose, and body mass index (BMI), which vary in unconstrained settings and raise concerns related to fairness and generalization. To address this, we extend the notion of expressivity, defined as the mutual information between learned features and specific attributes, using a secondary neural network to quantify how strongly attributes are encoded. Applying this framework to three ReID models, we find that BMI consistently shows the highest expressivity in the final layers, indicating its dominant role in recognition. In the last attention layer, attributes are ranked as BMI > Pitch > Gender > Yaw, revealing their relative influences in representation learning. Expressivity values also evolve across layers and training epochs, reflecting a dynamic encoding of attributes. These findings demonstrate the central role of body attributes in ReID and establish a principled approach for uncovering attribute driven correlations.

A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification

TL;DR

This work investigates how body attributes are encoded in vision-language-based person ReID representations by extending the notion of expressivity to the ReID domain and quantifying attribute information via Mutual Information Neural Estimation (MINE). The authors apply this MI-based expressivity framework to ViT-based ReID models (e.g., SemReID, PFD, DC-Former) using an augmented feature–attribute input and a neural estimator to measure I_theta(F,A) across layers and training epochs. Key findings show that BMI consistently exhibits the highest expressivity, especially in deeper layers, while yaw and pitch are more prominent in mid-layers and tend to be suppressed with training; gender remains moderately entangled but relatively stable. The work provides a principled, post-hoc explanation tool for attribute-driven correlations in ReID, with practical implications for fairness and robustness in open-set deployment, while acknowledging MI-based estimates may be influenced by attribute entropy.

Abstract

Person Re-identification (ReID) systems that match individuals across images or video frames are essential in many real-world applications. However, existing methods are often influenced by attributes such as gender, pose, and body mass index (BMI), which vary in unconstrained settings and raise concerns related to fairness and generalization. To address this, we extend the notion of expressivity, defined as the mutual information between learned features and specific attributes, using a secondary neural network to quantify how strongly attributes are encoded. Applying this framework to three ReID models, we find that BMI consistently shows the highest expressivity in the final layers, indicating its dominant role in recognition. In the last attention layer, attributes are ranked as BMI > Pitch > Gender > Yaw, revealing their relative influences in representation learning. Expressivity values also evolve across layers and training epochs, reflecting a dynamic encoding of attributes. These findings demonstrate the central role of body attributes in ReID and establish a principled approach for uncovering attribute driven correlations.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Integrating the MINE block with the ViT based SemReID huang2023self backbone to compute the expressivity of features with respect to attributes such as BMI, gender, pitch and yaw. The internal structure of the MINE block employs a simple MLP with two hidden layers to compute the expressivity of $m$-dimensional features $F$. By augmenting these features with an attribute vector $A$, the input to the network is extended to $(m+1)$-dimensions. All subjects involved provided informed consent for their participation, including the use of their images in research publications and figures.
  • Figure 2: Attribute distribution and counts in the BRIAR dataset indicate sufficient variation across the attributes of interest.
  • Figure 3: Attribute annotated exemplar images from the BRIAR dataset. All subjects involved provided informed consent for their participation, including the use of their images in research publications and figures.
  • Figure 4: Expressivity trends of gender, yaw, pitch and BMI in input image over layer-wise learnt features from SemReID.
  • Figure 5: Expressivity trends of gender, yaw, pitch and BMI in input image over epoch-wise learnt features from SemReID.