Table of Contents
Fetching ...

Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification

Ashwat Rajbhandari, Bharatesh Chakravarthi

Abstract

Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.

Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification

Abstract

Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.

Paper Structure

This paper contains 37 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Illustration of extreme far-distance conditions in the DetReIDX dataset. As UAV altitude and horizontal distance increase, pedestrians undergo severe scale compression and resolution degradation. The red bounding boxes highlight the same group of individuals as their pixel footprint shrinks dramatically, showing the visual challenges of aerial-ground video person ReID.
  • Figure 2: Overview of the proposed CLIP ViT-L/$14$-based video ReID framework. Augmented tracklets are encoded by a CLIP backbone enhanced with PBP prompt conditioning and adapter modules, with only the final two transformer blocks unfrozen during fine-tuning. Frame-level features are aggregated via temporal attention pooling and optimized using identity and triplet losses.
  • Figure 3: mAP comparison across baseline methods, the best public result, and our method on the DetReIDX benchmark.