Table of Contents
Fetching ...

One Graph to Track Them All: Dynamic GNNs for Single- and Multi-View Tracking

Martin Engilberge, Ivan Vrkic, Friedrich Wilke Grosche, Julien Pilet, Engin Turetken, Pascal Fua

TL;DR

The paper tackles online multi-people tracking under occlusion by unifying single- and multi-view scenarios within a dynamic spatiotemporal graph. A Unified Message Passing Network (UMPN) updates edge and vertex representations, assigns probabilities to potential connections, and extracts trajectories in an online fashion, optionally leveraging scene priors through camera vertices. The approach achieves state-of-the-art performance on WILDTRACK and MOT benchmarks and introduces SCOUT, a large-scale 25-view dataset with detailed scene reconstructions to better study occlusions and scene context. This framework advances practical surveillance and monitoring by enabling robust, end-to-end reasoning over time, views, and scene geometry while providing public dataset and code releases.

Abstract

This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.

One Graph to Track Them All: Dynamic GNNs for Single- and Multi-View Tracking

TL;DR

The paper tackles online multi-people tracking under occlusion by unifying single- and multi-view scenarios within a dynamic spatiotemporal graph. A Unified Message Passing Network (UMPN) updates edge and vertex representations, assigns probabilities to potential connections, and extracts trajectories in an online fashion, optionally leveraging scene priors through camera vertices. The approach achieves state-of-the-art performance on WILDTRACK and MOT benchmarks and introduces SCOUT, a large-scale 25-view dataset with detailed scene reconstructions to better study occlusions and scene context. This framework advances practical surveillance and monitoring by enabling robust, end-to-end reasoning over time, views, and scene geometry while providing public dataset and code releases.

Abstract

This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.

Paper Structure

This paper contains 54 sections, 18 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Multi-View Spatio-Temporal Graph. View Edges connect identical detections across views at the same timepoint. Temporal Edges connect detections in the same view at different timepoints, forming a fully connected graph (some connections omited for clarity). Contextual Edges connect detections to nearby ones in the same view at the same timepoint. Camera vertex and edges can be added to model scene priors and occlusions.
  • Figure 2: Graph-based people tracking.Top. Our UMPN network updates both vertex and edge feature vectors at each time-step and generates classification scores for the edges and vertices. These scores are used to derive the final trajectories. Bottom left. Dynamic graph construction using a sliding time window. Bottom right. Camera edges can encode environmental occlusions.
  • Figure 3: Scene structure and camera placement. Overlapping cameras cover a 450m path. Green triangles indicate camera positions and fields of view, while polygons mark visible ground areas.
  • Figure 4: Scene reconstruction. The top row displays frames captured by two distinct cameras and the bottom row presents their reconstructed textured scene meshes.
  • Figure 5: Graph construction ablation on MOT17. Context edges and longer temporal connections improve tracking performance, while lower detection confidence thresholds help capture more targets.
  • ...and 3 more figures