Table of Contents
Fetching ...

Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Mattia Segu, Luigi Piccinelli, Siyuan Li, Yung-Hsu Yang, Bernt Schiele, Luc Van Gool

TL;DR

This work introduces Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet, and proposes SambaMOTR, the first tracker effectively addressing long-range dependencies, tracklet interdependencies, and temporal occlusions.

Abstract

Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.

Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

TL;DR

This work introduces Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet, and proposes SambaMOTR, the first tracker effectively addressing long-range dependencies, tracklet interdependencies, and temporal occlusions.

Abstract

Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.
Paper Structure (37 sections, 4 equations, 5 figures, 10 tables)

This paper contains 37 sections, 4 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Tracking multiple objects in challenging scenarios - such as coordinated dance performances (a), dynamic animal groups (b), and team sports (c) - requires handling complex interactions, occlusions, and fast movements. As shown in the tracklets above, objects may move in coordinated patterns and occlude each other. By leveraging the joint long-range dependencies in their trajectories, SambaMOTR accurately tracks objects through time and occlusions.
  • Figure 2: Overview of SambaMOTR. SambaMOTR combines a transformer-based object detector with a set-of-sequences Samba model. The object detector's encoder extracts image features from each frame, which are fed into its decoder together with detect and track queries to detect newborn objects or re-detect tracked ones. The Samba set-of-sequences model is composed of multiple synchronized Samba units that simultaneously process the past memory and currently observed output queries for all tracklets to predict the next track queries and update the track memory. The hidden states of newborn objects are initialized from zero values (barred squares). In case of occlusions or uncertain detections, the corresponding query is masked (red cross) during the Samba update.
  • Figure 3: Synchronized State-Space Models. We illustrate a set of $k$ synchronized ssm. A Long-Term Memory Update block updates each hidden state $\Tilde{h}_{t-1}^i$ based on the current observation $x_t^i$, resulting in the updated memory $h_t^i$. The Memory Synchronization block then derives the synchronized hidden state $\Tilde{h}_t^i$, which is fed into the Output Update module to predict the output $y_t^i$.
  • Figure A: Illustration of our Set-of-sequences Model block. Our set-of-sequences model Samba simultaneously processes an arbitrary number $M$ of input sequences. Each sequence is processed by a Samba unit, synchronized with the others thanks to our synchronized state-space model. All Samba units share weights and are composed of a stack of $N$ Samba blocks. A Samba block has the same architecture as a Mamba block, but it adopts our synchronized SSM to synchronize long-term memory representations across the individual state-space models.
  • Figure B: Schematic illustration of our contributions (as ablated in \ref{['tab:method_components']}). State-space model (SSM) blocks at timesteps with gradient applied are in green, and blocks without gradient are in grey.