Table of Contents
Fetching ...

Spatial-Temporal Multi-Cuts for Online Multiple-Camera Vehicle Tracking

Fabian Herzog, Johannes Gilg, Philipp Wolters, Torben Teepe, Gerhard Rigoll

TL;DR

This work introduces a graph representation that allows spatial-temporal clustering in a single, combined step: new detections are spatially and temporally connected with existing clusters and can compare clusters based on the strongest available evidence.

Abstract

Accurate online multiple-camera vehicle tracking is essential for intelligent transportation systems, autonomous driving, and smart city applications. Like single-camera multiple-object tracking, it is commonly formulated as a graph problem of tracking-by-detection. Within this framework, existing online methods usually consist of two-stage procedures that cluster temporally first, then spatially, or vice versa. This is computationally expensive and prone to error accumulation. We introduce a graph representation that allows spatial-temporal clustering in a single, combined step: New detections are spatially and temporally connected with existing clusters. By keeping sparse appearance and positional cues of all detections in a cluster, our method can compare clusters based on the strongest available evidence. The final tracks are obtained online using a simple multicut assignment procedure. Our method does not require any training on the target scene, pre-extraction of single-camera tracks, or additional annotations. Notably, we outperform the online state-of-the-art on the CityFlow dataset in terms of IDF1 by more than 14%, and on the Synthehicle dataset by more than 25%, respectively. The code is publicly available.

Spatial-Temporal Multi-Cuts for Online Multiple-Camera Vehicle Tracking

TL;DR

This work introduces a graph representation that allows spatial-temporal clustering in a single, combined step: new detections are spatially and temporally connected with existing clusters and can compare clusters based on the strongest available evidence.

Abstract

Accurate online multiple-camera vehicle tracking is essential for intelligent transportation systems, autonomous driving, and smart city applications. Like single-camera multiple-object tracking, it is commonly formulated as a graph problem of tracking-by-detection. Within this framework, existing online methods usually consist of two-stage procedures that cluster temporally first, then spatially, or vice versa. This is computationally expensive and prone to error accumulation. We introduce a graph representation that allows spatial-temporal clustering in a single, combined step: New detections are spatially and temporally connected with existing clusters. By keeping sparse appearance and positional cues of all detections in a cluster, our method can compare clusters based on the strongest available evidence. The final tracks are obtained online using a simple multicut assignment procedure. Our method does not require any training on the target scene, pre-extraction of single-camera tracks, or additional annotations. Notably, we outperform the online state-of-the-art on the CityFlow dataset in terms of IDF1 by more than 14%, and on the Synthehicle dataset by more than 25%, respectively. The code is publicly available.
Paper Structure (22 sections, 12 equations, 2 figures, 4 tables)

This paper contains 22 sections, 12 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The proposed spatial-temporal multicuts (STMC). (1) Given the current video frames of all cameras, we apply a detector and extract the appearance features. (2) Features and detections are embedded in a common feature space. (3) Next, the measurements are aggregated into a superbox-structure and connected to existing tracks. (4) Weights are computed on the basis of appearance and distance similarity. (5) The multicut solver then yields cluster candidates, and detections and tracks within clusters are assigned to each other in an assignment procedure.
  • Figure 2: Matching thresholds for ReID and positions and their importance. We perform a grid search on the training subsets of Synthehicle (a) and CityFlow (b) to find the optimal parameters for weight scaling. For $\theta_{\text{feat}}$ and $\theta_{\text{pos}}$ we choose $(0.8, 4.0)$ and $(0.7, 0.001)$ for Synthehicle and CityFlow, respectively. The relative importance of the thresholds is depicted in Figure (c), and we set $\lambda$ (cf. \ref{['eq:scale']}) to $0.4$ for Synthehicle and to $0.9$ for CityFlow, i.e., for CityFlow. The ground plane is calibrated in meters for Synthehicle and in GPS for CityFlow.