Table of Contents
Fetching ...

Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation

Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes

TL;DR

Seg2Track-SAM2 is introduced, a framework that integrates pretrained object detectors with SAM2 and a dedicated Seg2Track module to support track initialization, data association, and track refinement and indicates that Seg2Track-SAM2 improves identity consistency and memory efficiency in MOTS without requiring dataset-specific training.

Abstract

Autonomous-driving perception systems require robust Multi-Object Tracking (MOT) to operate reliably in dynamic environments. MOT maintains consistent object identities across frames while preserving spatial accuracy. Recent foundation models, such as SAM2, provide promptable video segmentation without task-specific fine-tuning. However, their direct application to Multi-Object Tracking and Segmentation (MOTS) remains limited by the absence of explicit identity management mechanisms and by growing memory requirements during tracking. This work introduces Seg2Track-SAM2, a framework that integrates pretrained object detectors with SAM2 and a dedicated Seg2Track module to support track initialization, data association, and track refinement. The method operates without dataset-specific fine-tuning and remains detector-agnostic. Experimental evaluation on the KITTI MOTS and MOTS Challenge benchmarks shows that Seg2Track-SAM2 ranks fourth overall in both datasets while achieving the highest association accuracy (AssA) among compared methods. In addition, a sliding-window memory strategy reduces memory usage by up to 75% with minimal impact on tracking performance, enabling deployment under resource constraints. Together, these results indicate that Seg2Track-SAM2 improves identity consistency and memory efficiency in MOTS without requiring dataset-specific training. The code is available at https://github.com/hcmr-lab/Seg2Track-SAM2.

Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation

TL;DR

Seg2Track-SAM2 is introduced, a framework that integrates pretrained object detectors with SAM2 and a dedicated Seg2Track module to support track initialization, data association, and track refinement and indicates that Seg2Track-SAM2 improves identity consistency and memory efficiency in MOTS without requiring dataset-specific training.

Abstract

Autonomous-driving perception systems require robust Multi-Object Tracking (MOT) to operate reliably in dynamic environments. MOT maintains consistent object identities across frames while preserving spatial accuracy. Recent foundation models, such as SAM2, provide promptable video segmentation without task-specific fine-tuning. However, their direct application to Multi-Object Tracking and Segmentation (MOTS) remains limited by the absence of explicit identity management mechanisms and by growing memory requirements during tracking. This work introduces Seg2Track-SAM2, a framework that integrates pretrained object detectors with SAM2 and a dedicated Seg2Track module to support track initialization, data association, and track refinement. The method operates without dataset-specific fine-tuning and remains detector-agnostic. Experimental evaluation on the KITTI MOTS and MOTS Challenge benchmarks shows that Seg2Track-SAM2 ranks fourth overall in both datasets while achieving the highest association accuracy (AssA) among compared methods. In addition, a sliding-window memory strategy reduces memory usage by up to 75% with minimal impact on tracking performance, enabling deployment under resource constraints. Together, these results indicate that Seg2Track-SAM2 improves identity consistency and memory efficiency in MOTS without requiring dataset-specific training. The code is available at https://github.com/hcmr-lab/Seg2Track-SAM2.

Paper Structure

This paper contains 20 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of a comparative analysis of MOTS methods on the KITTI MOTS (top) and MOTS Challenge (below) benchmarks. The KITTI MOTS graph shows each method's performance based on its average HOTA against AssA for the car and pedestrian classes. The data points are color-coded to indicate the training dependency of each approach on a detector, a tracker, both, or neither. The figure shows that Seg2Track-SAM2 achieved the highest AssA score and is ranked in the top four for overall HOTA performance on both datasets. The proposed approach does not require fine-tuning on either a detector or a tracker for the target datasets.
  • Figure 2: Illustration of the Seg2Track-SAM2 approach. The Seg2Track-SAM2 approach is a comprehensive system that combines an object detector, SAM2, and a track management module (Seg2Track) to perform robust MOTS. An input image is simultaneously analyzed by the object detector to generate bounding box proposals and by SAM2 to create segmentation masks. The Seg2Track module then uses these bounding boxes and previously generated masks to manage object tracks over time, initiating new tracks, reinforcing existing tracks, and removing stale ones through a series of association, filtering, and quality assessment processes.
  • Figure 3: Qualitative comparison of the Seg2Track-SAM2 approach against SAM2 baseline on the MOTS task. The images illustrate the system's ability to manage object tracks in various scenarios. Specifically, blue bounding boxes indicate a track initialization, signifying a newly detected object. In contrast, red bounding boxes represent successful object reinforcement, where a previously tracked object is correctly re-associated with its track.
  • Figure 4: Illustration of the effect of varying the backward temporal window size ($T_w$) on Seg2Track-SAM2 performance. The y-axis reports the percentage difference with respect to the full-history baseline without the sliding state window. As the window size increases, HOTA, DetA, AssA, and LocA remain stable, with only minor fluctuations around the baseline. In contrast, memory usage shows a consistent decrease, surpassing 75% for smaller window sizes compared to the full-history configuration.