Unconstrained Body Recognition at Altitude and Range: Comparing Four Approaches
Blake A Myers, Matthew Q Hill, Veda Nandan Gandi, Thomas M Metz, Alice J O'Toole
TL;DR
This work addresses long-term body identification under unconstrained conditions, where face cues may be unavailable or unreliable. It systematically compares two Vision Transformer models (BIDDS and Swin-BIDDS) with two ResNet-based models (LCRIM and NLCRIM) by training on nearly two million images from nine datasets and evaluating on both standard re-ID benchmarks and the challenging BTS dataset collected at distance, altitude, and with clothing changes. The results show that ViT-based models outperform ResNets, with Swin-BIDDS achieving the best overall performance, primarily due to larger input sizes and hierarchical attention; linguistic pre-training offers limited gains. The findings highlight the viability of transformer-based, body-shape–driven identification for real-world scenarios and point to future directions in leveraging larger, semi-supervised datasets and shape-focused encoders to further improve robustness.
Abstract
This study presents an investigation of four distinct approaches to long-term person identification using body shape. Unlike short-term re-identification systems that rely on temporary features (e.g., clothing), we focus on learning persistent body shape characteristics that remain stable over time. We introduce a body identification model based on a Vision Transformer (ViT) (Body Identification from Diverse Datasets, BIDDS) and on a Swin-ViT model (Swin-BIDDS). We also expand on previous approaches based on the Linguistic and Non-linguistic Core ResNet Identity Models (LCRIM and NLCRIM), but with improved training. All models are trained on a large and diverse dataset of over 1.9 million images of approximately 5k identities across 9 databases. Performance was evaluated on standard re-identification benchmark datasets (MARS, MSMT17, Outdoor Gait, DeepChange) and on an unconstrained dataset that includes images at a distance (from close-range to 1000m), at altitude (from an unmanned aerial vehicle, UAV), and with clothing change. A comparative analysis across these models provides insights into how different backbone architectures and input image sizes impact long-term body identification performance across real-world conditions.
