Table of Contents
Fetching ...

SoDA: Multi-Object Tracking with Soft Data Association

Wei-Chih Hung, Henrik Kretzschmar, Tsung-Yi Lin, Yuning Chai, Ruichi Yu, Ming-Hsuan Yang, Dragomir Anguelov

TL;DR

This paper tackles robust multi-object tracking in cluttered autonomous-driving scenes by replacing hard data associations with soft, attention-based data aggregation. By introducing attention measurement encoding, it produces track embeddings that encode spatiotemporal context across a temporal window, while an explicit occlusion state enables reasoning about objects that are temporarily hidden. The approach demonstrates improved MOTA and IDF1 on Waymo, KITTI, and MOT17 benchmarks, with ablations confirming the benefits of both the encoding and occlusion mechanisms. The work also shows scalability to large-scale datasets and highlights potential offline advantages through future-context integration, suggesting practical impact for real-time tracking in self-driving systems. Overall, the method advances MOT by learning rich context without rigid associations, improving robustness to occlusions and detector noise.

Abstract

Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars. Tracking objects, however, remains a highly challenging problem, especially in cluttered autonomous driving scenes in which objects tend to interact with each other in complex ways and frequently get occluded. We propose a novel approach to MOT that uses attention to compute track embeddings that encode the spatiotemporal dependencies between observed objects. This attention measurement encoding allows our model to relax hard data associations, which may lead to unrecoverable errors. Instead, our model aggregates information from all object detections via soft data associations. The resulting latent space representation allows our model to learn to reason about occlusions in a holistic data-driven way and maintain track estimates for objects even when they are occluded. Our experimental results on the Waymo OpenDataset suggest that our approach leverages modern large-scale datasets and performs favorably compared to the state of the art in visual multi-object tracking.

SoDA: Multi-Object Tracking with Soft Data Association

TL;DR

This paper tackles robust multi-object tracking in cluttered autonomous-driving scenes by replacing hard data associations with soft, attention-based data aggregation. By introducing attention measurement encoding, it produces track embeddings that encode spatiotemporal context across a temporal window, while an explicit occlusion state enables reasoning about objects that are temporarily hidden. The approach demonstrates improved MOTA and IDF1 on Waymo, KITTI, and MOT17 benchmarks, with ablations confirming the benefits of both the encoding and occlusion mechanisms. The work also shows scalability to large-scale datasets and highlights potential offline advantages through future-context integration, suggesting practical impact for real-time tracking in self-driving systems. Overall, the method advances MOT by learning rich context without rigid associations, improving robustness to occlusions and detector noise.

Abstract

Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars. Tracking objects, however, remains a highly challenging problem, especially in cluttered autonomous driving scenes in which objects tend to interact with each other in complex ways and frequently get occluded. We propose a novel approach to MOT that uses attention to compute track embeddings that encode the spatiotemporal dependencies between observed objects. This attention measurement encoding allows our model to relax hard data associations, which may lead to unrecoverable errors. Instead, our model aggregates information from all object detections via soft data associations. The resulting latent space representation allows our model to learn to reason about occlusions in a holistic data-driven way and maintain track estimates for objects even when they are occluded. Our experimental results on the Waymo OpenDataset suggest that our approach leverages modern large-scale datasets and performs favorably compared to the state of the art in visual multi-object tracking.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview. We propose a novel approach to multi-object tracking. Given object detections based on the measurements $z^0$, our model encodes spatiotemporal context information for each measurement with $N$ self-attention layers, resulting in features $z^N$ that learn from soft association values, which do not rely on hard-associated tracks. Based on the aggregated features, the model then predicts a probability distribution for each track that captures soft data associations and a latent state $z_{\text{occ}}$, which indicates that the track is occluded.
  • Figure 2: Comparison of encoding methods. We illustrate how tracking methods use context and history. The red and orange circles represent detections associated with tracks. The white circles represent incoming detections that are yet to be associated. (a) Consider how similar an incoming detection is to the latest detection associated with each track. (b) Aggregate information from all detections that are associated with each track via hard data association. (c) Share information between tracks. (d) Aggregate information from all detections to leverage the spatiotemporal context without committing to any hard data associations.
  • Figure 3: Attention association with explicit occlusion reasoning. Our method explicitly reasons about occlusions by attending to a separate occlusion state. The orange circles refer to associated detections, the white circles refer to incoming detections, and the gray circles refer to occlusion states. The model classifies a track as occluded if the track embedding most strongly attends to the occlusion state and maintains the state embedding for future association.
  • Figure 4: Learning curves. The learning curves suggest that our approach benefits the most from modern large-scale datasets, such as the Waymo Open Dataset.
  • Figure 5: An example showcasing the explicit occlusion reasoning. We show an occlusion scenario as handled by our model with and without explicit occlusion reasoning. Top: The baseline model without explicit occlusion reasoning. Bottom: Our method with attention measurement encoding and explicit occlusion reasoning. The car on the left side is occluded between $t=3$ and $t=7$. At $t=3$, our method attends to the occlusion state with a value of $0.43$, maintains the track throughout the occlusion, and then recovers the same track at $t=7$.
  • ...and 2 more figures