Table of Contents
Fetching ...

Unconstrained Body Recognition at Altitude and Range: Comparing Four Approaches

Blake A Myers, Matthew Q Hill, Veda Nandan Gandi, Thomas M Metz, Alice J O'Toole

TL;DR

This work addresses long-term body identification under unconstrained conditions, where face cues may be unavailable or unreliable. It systematically compares two Vision Transformer models (BIDDS and Swin-BIDDS) with two ResNet-based models (LCRIM and NLCRIM) by training on nearly two million images from nine datasets and evaluating on both standard re-ID benchmarks and the challenging BTS dataset collected at distance, altitude, and with clothing changes. The results show that ViT-based models outperform ResNets, with Swin-BIDDS achieving the best overall performance, primarily due to larger input sizes and hierarchical attention; linguistic pre-training offers limited gains. The findings highlight the viability of transformer-based, body-shape–driven identification for real-world scenarios and point to future directions in leveraging larger, semi-supervised datasets and shape-focused encoders to further improve robustness.

Abstract

This study presents an investigation of four distinct approaches to long-term person identification using body shape. Unlike short-term re-identification systems that rely on temporary features (e.g., clothing), we focus on learning persistent body shape characteristics that remain stable over time. We introduce a body identification model based on a Vision Transformer (ViT) (Body Identification from Diverse Datasets, BIDDS) and on a Swin-ViT model (Swin-BIDDS). We also expand on previous approaches based on the Linguistic and Non-linguistic Core ResNet Identity Models (LCRIM and NLCRIM), but with improved training. All models are trained on a large and diverse dataset of over 1.9 million images of approximately 5k identities across 9 databases. Performance was evaluated on standard re-identification benchmark datasets (MARS, MSMT17, Outdoor Gait, DeepChange) and on an unconstrained dataset that includes images at a distance (from close-range to 1000m), at altitude (from an unmanned aerial vehicle, UAV), and with clothing change. A comparative analysis across these models provides insights into how different backbone architectures and input image sizes impact long-term body identification performance across real-world conditions.

Unconstrained Body Recognition at Altitude and Range: Comparing Four Approaches

TL;DR

This work addresses long-term body identification under unconstrained conditions, where face cues may be unavailable or unreliable. It systematically compares two Vision Transformer models (BIDDS and Swin-BIDDS) with two ResNet-based models (LCRIM and NLCRIM) by training on nearly two million images from nine datasets and evaluating on both standard re-ID benchmarks and the challenging BTS dataset collected at distance, altitude, and with clothing changes. The results show that ViT-based models outperform ResNets, with Swin-BIDDS achieving the best overall performance, primarily due to larger input sizes and hierarchical attention; linguistic pre-training offers limited gains. The findings highlight the viability of transformer-based, body-shape–driven identification for real-world scenarios and point to future directions in leveraging larger, semi-supervised datasets and shape-focused encoders to further improve robustness.

Abstract

This study presents an investigation of four distinct approaches to long-term person identification using body shape. Unlike short-term re-identification systems that rely on temporary features (e.g., clothing), we focus on learning persistent body shape characteristics that remain stable over time. We introduce a body identification model based on a Vision Transformer (ViT) (Body Identification from Diverse Datasets, BIDDS) and on a Swin-ViT model (Swin-BIDDS). We also expand on previous approaches based on the Linguistic and Non-linguistic Core ResNet Identity Models (LCRIM and NLCRIM), but with improved training. All models are trained on a large and diverse dataset of over 1.9 million images of approximately 5k identities across 9 databases. Performance was evaluated on standard re-identification benchmark datasets (MARS, MSMT17, Outdoor Gait, DeepChange) and on an unconstrained dataset that includes images at a distance (from close-range to 1000m), at altitude (from an unmanned aerial vehicle, UAV), and with clothing change. A comparative analysis across these models provides insights into how different backbone architectures and input image sizes impact long-term body identification performance across real-world conditions.

Paper Structure

This paper contains 25 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Example body images from the BTS dataset cornett2023expanding. Subject consented to publication.
  • Figure 2: Ablation: CMC and ROC curves for the architecture and image size comparisons show that image size is the critical factor in the superior performance of the Swin-BIDDS model over the BIDDS model.
  • Figure 3: Difference in performance by model architecture (ViT, Swin-ViT) and input image size (224 px$^2$, 384 px$^2$). Comparisons shown for each of four datasets (DeepChange, MARS, MSMT, and Outdoor Gait) on four identification metrics (retrieval at ranks 1 and 20, true accept rate at false accept rates $10^{-3}$ and $10^{-4}$). Architecture Difference (blue) is defined as the difference between Swin-BIDDS(224,224) and BIDDS(224,224). Image Size Difference (orange) is defined as the difference between Swin-BIDDS(384,384) and Swin-BIDDS(224,224).