Table of Contents
Fetching ...

Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification

Md Rashidunnabi, Kailash A. Hambarde, Vasco Lopes, Joao C. Neves, Hugo Proenca

TL;DR

This work addresses cross-view video person re-identification under aerial-ground settings by proposing MTF-CVReID, a parameter-efficient framework that augments a frozen ViT-B/16 backbone with seven lightweight adapters to tackle view biases, scale disparities, and temporal misalignment. The seven modules—CSFN, MRFH, IAMM, TDM, IVFA, HTPL, and MVICL—systematically normalize views, harmonize scales, reinforce identity with memory, model multi-scale temporal dynamics, align cross-view features, and enforce cross-view identity coherence, while maintaining real-time inference (~189 FPS). A two-stage training strategy preserves backbone stability and selectively tunes high-level components, achieving state-of-the-art results on AG--VPReID and strong generalization to G2A-VReID and MARS. The approach demonstrates that carefully designed adapters can substantially improve cross-view robustness and temporal consistency with minimal computational overhead, enabling practical deployment in heterogeneous surveillance contexts.

Abstract

Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID

Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification

TL;DR

This work addresses cross-view video person re-identification under aerial-ground settings by proposing MTF-CVReID, a parameter-efficient framework that augments a frozen ViT-B/16 backbone with seven lightweight adapters to tackle view biases, scale disparities, and temporal misalignment. The seven modules—CSFN, MRFH, IAMM, TDM, IVFA, HTPL, and MVICL—systematically normalize views, harmonize scales, reinforce identity with memory, model multi-scale temporal dynamics, align cross-view features, and enforce cross-view identity coherence, while maintaining real-time inference (~189 FPS). A two-stage training strategy preserves backbone stability and selectively tunes high-level components, achieving state-of-the-art results on AG--VPReID and strong generalization to G2A-VReID and MARS. The approach demonstrates that carefully designed adapters can substantially improve cross-view robustness and temporal consistency with minimal computational overhead, enabling practical deployment in heterogeneous surveillance contexts.

Abstract

Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID

Paper Structure

This paper contains 20 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 2: Overall architecture of the proposed MTF--CVReID framework. The pipeline starts from TF--CLIP ViT-B/16 frame tokens and integrates seven key modules—CSFN, MRFH, IAMM, TDM, IVFA, HTPL, and MVICL—that collectively enhance cross-view robustness, temporal coherence, and scale adaptation. These components extend the core design of VM--TAPS to produce perspective-invariant and temporally stable embeddings with minimal computational overhead, achieving an optimal balance between accuracy and efficiency.
  • Figure 3: Schema of CLIP-Memory: CSFN,$\rightarrow$,MRFH,$\rightarrow$,IAMM. Left: CSFN applies a view-conditioned residual normalization to CLIP tokens to neutralize aerial/ground/wearable biases; middle: MRFH forms three "virtual scales" and fuses them with content-adaptive weights to stabilize person size across altitudes; right: IAMM implements a View-Aware Memory Bank, where for each identity–view pair $(n,v)$, $S$ prototypes are stored in $\mathbf{M}{n,v}$. During training, a clip descriptor $\mathbf{f}^{(b)}$ attends to its slice $\mathbf{M}{y^{(b)},v^{(b)}}$ to obtain a context $\mathbf{c}^{(b)}$, which is then gated and fused back into the representation, reinforcing stable identity traits across time and viewpoints. At inference, retrieval is class-agnostic based on feature similarity.
  • Figure 4: Temporal–Memory Diffusion across Views. Left: TDM computes frame differences ($\Delta$) and gates motion with appearance to form motion-aware tokens; in parallel, HTPL aggregates multi-scale temporal streams (s=1,2,4,8) and fuses them ($\oplus$) to provide longer-range context. Center: TMC (frame-to-memory token condensation) summarizes per-frame tokens into compact clip memories that are ready for cross-view exchange. Right: IVFA performs cross-view information exchange (CVIM) by attending to complementary-view prototypes and then diffuses the retrieved context back into tokens for view-aligned embeddings; the MVICL dashed head (loss only) enforces cross-view identity consistency and back-propagates to the alignment path, improving A2G/G2A matching without affecting inference.
  • Figure 5: Top-5 cross-view retrieval under three stressors: (a) tiny high-altitude targets (80--120 m), (b) large aerial$\leftrightarrow$ground viewpoint gaps, and (c) look-alike clothing. In each row the leftmost image is the query, followed by TF--CLIP and MTF--CVReID ranked lists (green = correct, red = incorrect). The shown cases are representative of the hardest AG--VPReID conditions. Our modules target these failure modes explicitly: CSFN and MRFH stabilize view/scale so the gallery focuses on shape and proportion rather than noisy textures; IAMM reinforces persistent identity cues (e.g., backpack outline, shoe contrast) across frames; TDM and HTPL contribute short- and long-horizon motion patterns (gait, cadence); IVFA with MVICL pulls aerial and ground embeddings into a shared identity space. Together these effects flip near-misses into Rank-1 wins and produce cleaner shortlists, in line with the quantitative gains in Tables \ref{['tab:efficiency_tradeoffs_clean_grouped']} and \ref{['tab:ablation_ranking_aligned_onedec']}.
  • Figure 6: t-SNE of clip embeddings. Each dot is a clip and each color an identity; panels show TF--CLIP (left) versus MTF--CVReID (right) for A2G (top) and G2A (bottom), all with identical t-SNE settings. Our method yields tighter within-identity clusters and larger inter-identity gaps—silhouette improves from 0.6362 $\rightarrow$ 0.7025 (A2G) and 0.7338 $\rightarrow$ 0.8045 (G2A)—indicating more view-invariant, discriminative embeddings that explain the higher retrieval scores.