Table of Contents
Fetching ...

SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation

Junjie Jiang, Zelin Wang, Manqi Zhao, Yin Li, DongSheng Jiang

TL;DR

This work argues that segmentation should be the core of multi-object tracking rather than a auxiliary cue tied to detection. It introduces SAM2MOT, a unified zero-shot MOT system that integrates a pre-trained detector and a pre-trained segmentor with dedicated modules for cross-object interaction and trajectory management. Across DanceTrack, UAVDT, and BDD100K, SAM2MOT achieves state-of-the-art results, notably improving identity association and robustness under occlusions without fine-tuning. By reducing dependence on labeled tracking data and enabling high-quality pre-annotations, it offers a pathway toward scalable data collection and more generalizable MOT solutions.

Abstract

Inspired by Segment Anything 2, which generalizes segmentation from images to videos, we propose SAM2MOT--a novel segmentation-driven paradigm for multi-object tracking that breaks away from the conventional detection-association framework. In contrast to previous approaches that treat segmentation as auxiliary information, SAM2MOT places it at the heart of the tracking process, systematically tackling challenges like false positives and occlusions. Its effectiveness has been thoroughly validated on major MOT benchmarks. Furthermore, SAM2MOT integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning. This significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.

SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation

TL;DR

This work argues that segmentation should be the core of multi-object tracking rather than a auxiliary cue tied to detection. It introduces SAM2MOT, a unified zero-shot MOT system that integrates a pre-trained detector and a pre-trained segmentor with dedicated modules for cross-object interaction and trajectory management. Across DanceTrack, UAVDT, and BDD100K, SAM2MOT achieves state-of-the-art results, notably improving identity association and robustness under occlusions without fine-tuning. By reducing dependence on labeled tracking data and enabling high-quality pre-annotations, it offers a pathway toward scalable data collection and more generalizable MOT solutions.

Abstract

Inspired by Segment Anything 2, which generalizes segmentation from images to videos, we propose SAM2MOT--a novel segmentation-driven paradigm for multi-object tracking that breaks away from the conventional detection-association framework. In contrast to previous approaches that treat segmentation as auxiliary information, SAM2MOT places it at the heart of the tracking process, systematically tackling challenges like false positives and occlusions. Its effectiveness has been thoroughly validated on major MOT benchmarks. Furthermore, SAM2MOT integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning. This significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: IDF1-HOTA-AssA comparisons of different trackers on the test set of DanceTrack, where the horizontal axis represents HOTA, the vertical axis represents IDF1, and the circle radius indicates AssA. This comparison highlights our method's superior capability in associating objects across frames, surpassing all previous trackers.
  • Figure 2: Overview of our Tracking-by-Segmentation MOT framework, including: (a) the overall architecture of SAM2MOT; (b) analysis of the relationship between detector, tracker, and data dependency.
  • Figure 3: Cross-object interaction pipeline. During motion, when severe occlusion exceeds a predefined threshold(0.8), we identify identity-confused objects by analyzing their logits scores and corresponding variance. The memory entries of such objects in the current frame are then excluded from being written into the memory bank to prevent the propagation of incorrect information.
  • Figure 4: Sample tracking results visualization of ByteTrack and SAM2MOT using the same detector on DanceTrack, BDD100K-MOT and UAVDT-MOT. The results indicate that SAM2MOT significantly outperforms ByteTrack in association performance under scenarios involving camera motion, detector degradation, and occlusion.