Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification
Md Rashidunnabi, Kailash A. Hambarde, Vasco Lopes, Joao C. Neves, Hugo Proenca
TL;DR
This work addresses cross-view video person re-identification under aerial-ground settings by proposing MTF-CVReID, a parameter-efficient framework that augments a frozen ViT-B/16 backbone with seven lightweight adapters to tackle view biases, scale disparities, and temporal misalignment. The seven modules—CSFN, MRFH, IAMM, TDM, IVFA, HTPL, and MVICL—systematically normalize views, harmonize scales, reinforce identity with memory, model multi-scale temporal dynamics, align cross-view features, and enforce cross-view identity coherence, while maintaining real-time inference (~189 FPS). A two-stage training strategy preserves backbone stability and selectively tunes high-level components, achieving state-of-the-art results on AG--VPReID and strong generalization to G2A-VReID and MARS. The approach demonstrates that carefully designed adapters can substantially improve cross-view robustness and temporal consistency with minimal computational overhead, enabling practical deployment in heterogeneous surveillance contexts.
Abstract
Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID
