Dissecting Human Body Representations in Deep Networks Trained for Person Identification
Thomas M Metz, Matthew Q Hill, Blake Myers, Veda Nandan Gandi, Rahul Chilakapati, Alice J O'Toole
TL;DR
This work probes how deep body-identification networks encode information beyond identity, including residual face cues, gender, viewpoint, and image-origin signals, by analyzing four backbones trained on nearly 2 million images across 9 datasets. Through face-obscured and face-only tests, linear readouts for gender and viewpoint, and PCA-based subspace editing, the authors show that facial information aids body-ID, yet nonidentity attributes persist in embeddings and can be exploited for improved retrieval without retraining. The study provides cross-architecture evidence that simple subspace techniques, including selective deletion of early principal components, can boost Rank-1, TAR@FAR $10^{-3}$, and mAP across datasets, while highlighting potential privacy and security implications of such leakage. These insights illuminate both opportunities for semantic editing and risks in biometric systems, and offer practical methods for improving long-term body re-identification performance with no additional training.
Abstract
Long-term body identification algorithms have emerged recently with the increased availability of high-quality training data. We seek to fill knowledge gaps about these models by analyzing body image embeddings from four body identification networks trained with 1.9 million images across 4,788 identities and 9 databases. By analyzing a diverse range of architectures (ViT, SWIN-ViT, CNN, and linguistically primed CNN), we first show that the face contributes to the accuracy of body identification algorithms and that these algorithms can identify faces to some extent -- with no explicit face training. Second, we show that representations (embeddings) generated by body identification algorithms encode information about gender, as well as image-based information including view (yaw) and even the dataset from which the image originated. Third, we demonstrate that identification accuracy can be improved without additional training by operating directly and selectively on the learned embedding space. Leveraging principal component analysis (PCA), identity comparisons were consistently more accurate in subspaces that eliminated dimensions that explained large amounts of variance. These three findings were surprisingly consistent across architectures and test datasets. This work represents the first analysis of body representations produced by long-term re-identification networks trained on challenging unconstrained datasets.
