SoDA: Multi-Object Tracking with Soft Data Association
Wei-Chih Hung, Henrik Kretzschmar, Tsung-Yi Lin, Yuning Chai, Ruichi Yu, Ming-Hsuan Yang, Dragomir Anguelov
TL;DR
This paper tackles robust multi-object tracking in cluttered autonomous-driving scenes by replacing hard data associations with soft, attention-based data aggregation. By introducing attention measurement encoding, it produces track embeddings that encode spatiotemporal context across a temporal window, while an explicit occlusion state enables reasoning about objects that are temporarily hidden. The approach demonstrates improved MOTA and IDF1 on Waymo, KITTI, and MOT17 benchmarks, with ablations confirming the benefits of both the encoding and occlusion mechanisms. The work also shows scalability to large-scale datasets and highlights potential offline advantages through future-context integration, suggesting practical impact for real-time tracking in self-driving systems. Overall, the method advances MOT by learning rich context without rigid associations, improving robustness to occlusions and detector noise.
Abstract
Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars. Tracking objects, however, remains a highly challenging problem, especially in cluttered autonomous driving scenes in which objects tend to interact with each other in complex ways and frequently get occluded. We propose a novel approach to MOT that uses attention to compute track embeddings that encode the spatiotemporal dependencies between observed objects. This attention measurement encoding allows our model to relax hard data associations, which may lead to unrecoverable errors. Instead, our model aggregates information from all object detections via soft data associations. The resulting latent space representation allows our model to learn to reason about occlusions in a holistic data-driven way and maintain track estimates for objects even when they are occluded. Our experimental results on the Waymo OpenDataset suggest that our approach leverages modern large-scale datasets and performs favorably compared to the state of the art in visual multi-object tracking.
