The Early Bird Identifies the Worm: You Can't Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)
Thomas M. Metz, Matthew Q. Hill, Alice J. O'Toole
TL;DR
The paper investigates whether fine-tuning vision foundation models directly on a domain of unconstrained clothes-change data can outperform traditional domain-transfer approaches for long-term re-identification. It introduces a simple, data-efficient UPD transfer protocol and demonstrates that four foundation models (CLIP, DINOv2, AIMv2, EVA-02) achieve state-of-the-art results across constrained and unconstrained clothes-change re-id tasks, with EVA-02-based ECHO-BID delivering the strongest head-start. Ablation studies reveal that backbone size and the chosen transfer protocol are critical factors, while pretraining scale beyond ImageNet-21k offers limited gains. Fusion of the four fine-tuned foundation models further improves performance, suggesting that foundation-model diversity can be leveraged to push long-term re-id beyond prior complex methods, using modest domain data.
Abstract
A wide range of model-based approaches to long-term person re-identification have been proposed. Whether these models perform more accurately than direct domain transfer learning applied to extensively trained large-scale foundation models is not known. We applied domain transfer learning for long-term person re-id to four vision foundation models (CLIP, DINOv2, AIMv2, and EVA-02). Domain-adapted versions of all four models %CLIP-L, DINOv2-L, AIMv2-L, and EVA-02-L surpassed existing state-of-the-art models by a large margin in highly unconstrained viewing environments. Decision score fusion of the four models improved performance over any individual model. Of the individual models, the EVA-02 foundation model provided the best ``head start'' to long-term re-id, surpassing other models on three of the four performance metrics by substantial margins. Accordingly, we introduce $\textbf{E}$va $\textbf{C}$lothes-Change from $\textbf{H}$idden $\textbf{O}$bjects - $\textbf{B}$ody $\textbf{ID}$entification (ECHO-BID), a class of long-term re-id models built on the object-pretrained EVA-02 Large backbones. Ablation experiments varying backbone size, scale of object classification pretraining, and transfer learning protocol indicated that model size and the use of a smaller, but more challenging transfer learning protocol are critical features in performance. We conclude that foundation models provide a head start to domain transfer learning and support state-of-the-art performance with modest amounts of domain data. The limited availability of long-term re-id data makes this approach advantageous.
