Table of Contents
Fetching ...

Object Re-identification via Spatial-temporal Fusion Networks and Causal Identity Matching

Hye-Geun Kim, Yong-Hyuk Moon, Yeong-Jun Cho

TL;DR

Object re-identification in large camera networks faces severe appearance ambiguities and real-world streaming constraints. The authors propose FusionNet, a spatial-temporal fusion network that learns a final similarity $S_F$ by combining appearance similarity $S_A$ with spatial-temporal signals $S_T$ derived from camera transitions, with $S_T$ defined over a time window using $p_{ij}(\tau)$; topology estimation is guided by an adaptive Parzen window. Complementing this, Causal Identity Matching (CIM) dynamically builds galleries and merges queries along adjacent cameras using a camera adjacency matrix and transition-time distributions, enabling ID-to-ID ReID in real-world settings. The study introduces a new Vehicle-3I dataset and demonstrates that adaptive topology estimation, FusionNet, Top-$k$ multi-shot matching, and CIM yield state-of-the-art or competitive performance on vehicle and person ReID across VeRi776, Vehicle-3I, and Market-1501, validating practical applicability in real-world surveillance systems.

Abstract

Object re-identification (ReID) in large camera networks faces numerous challenges. First, the similar appearances of objects degrade ReID performance, a challenge that needs to be addressed by existing appearance-based ReID methods. Second, most ReID studies are performed in laboratory settings and do not consider real-world scenarios. To overcome these challenges, we introduce a novel ReID framework that leverages a spatial-temporal fusion network and causal identity matching (CIM). Our framework estimates camera network topology using a proposed adaptive Parzen window and combines appearance features with spatial-temporal cues within the fusion network. This approach has demonstrated outstanding performance across several datasets, including VeRi776, Vehicle-3I, and Market-1501, achieving up to 99.70% rank-1 accuracy and 95.5% mAP. Furthermore, the proposed CIM approach, which dynamically assigns gallery sets based on camera network topology, has further improved ReID accuracy and robustness in real-world settings, evidenced by a 94.95% mAP and a 95.19% F1 score on the Vehicle-3I dataset. The experimental results support the effectiveness of incorporating spatial-temporal information and CIM for real-world ReID scenarios, regardless of the data domain (e.g., vehicle, person).

Object Re-identification via Spatial-temporal Fusion Networks and Causal Identity Matching

TL;DR

Object re-identification in large camera networks faces severe appearance ambiguities and real-world streaming constraints. The authors propose FusionNet, a spatial-temporal fusion network that learns a final similarity by combining appearance similarity with spatial-temporal signals derived from camera transitions, with defined over a time window using ; topology estimation is guided by an adaptive Parzen window. Complementing this, Causal Identity Matching (CIM) dynamically builds galleries and merges queries along adjacent cameras using a camera adjacency matrix and transition-time distributions, enabling ID-to-ID ReID in real-world settings. The study introduces a new Vehicle-3I dataset and demonstrates that adaptive topology estimation, FusionNet, Top- multi-shot matching, and CIM yield state-of-the-art or competitive performance on vehicle and person ReID across VeRi776, Vehicle-3I, and Market-1501, validating practical applicability in real-world surveillance systems.

Abstract

Object re-identification (ReID) in large camera networks faces numerous challenges. First, the similar appearances of objects degrade ReID performance, a challenge that needs to be addressed by existing appearance-based ReID methods. Second, most ReID studies are performed in laboratory settings and do not consider real-world scenarios. To overcome these challenges, we introduce a novel ReID framework that leverages a spatial-temporal fusion network and causal identity matching (CIM). Our framework estimates camera network topology using a proposed adaptive Parzen window and combines appearance features with spatial-temporal cues within the fusion network. This approach has demonstrated outstanding performance across several datasets, including VeRi776, Vehicle-3I, and Market-1501, achieving up to 99.70% rank-1 accuracy and 95.5% mAP. Furthermore, the proposed CIM approach, which dynamically assigns gallery sets based on camera network topology, has further improved ReID accuracy and robustness in real-world settings, evidenced by a 94.95% mAP and a 95.19% F1 score on the Vehicle-3I dataset. The experimental results support the effectiveness of incorporating spatial-temporal information and CIM for real-world ReID scenarios, regardless of the data domain (e.g., vehicle, person).
Paper Structure (23 sections, 16 equations, 12 figures, 7 tables)

This paper contains 23 sections, 16 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Challenges of object re-identification in large-scale camera networks due to appearance ambiguities and computational complexities
  • Figure 2: The overall ReID methods for object re-identification with spatial-temporal information in real-world scenarios.
  • Figure 3: Examples of estimated transition time distributions between camera pairs. Each bin covers 100 frame ranges. Solid blue lines (---) mark the estimated distribution ($p_{ij}$) from the histogram ($h_{ij}$) using the proposed adaptive Parzen window (best viewed in color).
  • Figure 4: Comparisons of ReID methodologies: Circles and boxes denote cameras and appearances. The number inside each box represents the appearance ID. Boxes filled with olive color indicate true positives concerning the query. Boxes with dotted lines indicate appearances excluded from the gallery. (a) One-to-all ReID using only a single appearance for the query performs many redundant and duplicated comparisons, even comparing different objects within the same camera. (b) ID-to-ID ReID uses multiple appearances and considers causality to determine the gallery.
  • Figure 5: Feature distributions of objects. Example image patches reflect their actual sizes.
  • ...and 7 more figures