Table of Contents
Fetching ...

A Distractor-Aware Memory for Visual Object Tracking with SAM2

Jovana Videnovic, Alan Lukezic, Matej Kristan

TL;DR

The paper tackles distractor robustness in memory-based visual object tracking by introducing a distractor-aware memory (DAM) for SAM2 that splits memory into a Recent Appearance Memory (RAM) and a Distractor Resolving Memory (DRM). It proposes RAM/DRM-specific update protocols and an introspection-based DRM update triggered by SAM2 outputs, along with a distractor-distilled dataset (DiDi) to stress distractor handling. Without retraining, SAM2.1++ achieves state-of-the-art results across multiple benchmarks (VOT, bounding-box datasets) and on the new DiDi dataset, with notable gains in robustness and tracking quality and only modest speed trade-offs. The work highlights the importance of memory structure in tracking, suggesting that future gains may come from learnable memory management policies that further optimize the balance between appearance modeling and distractor suppression.

Abstract

Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.

A Distractor-Aware Memory for Visual Object Tracking with SAM2

TL;DR

The paper tackles distractor robustness in memory-based visual object tracking by introducing a distractor-aware memory (DAM) for SAM2 that splits memory into a Recent Appearance Memory (RAM) and a Distractor Resolving Memory (DRM). It proposes RAM/DRM-specific update protocols and an introspection-based DRM update triggered by SAM2 outputs, along with a distractor-distilled dataset (DiDi) to stress distractor handling. Without retraining, SAM2.1++ achieves state-of-the-art results across multiple benchmarks (VOT, bounding-box datasets) and on the new DiDi dataset, with notable gains in robustness and tracking quality and only modest speed trade-offs. The work highlights the importance of memory structure in tracking, suggesting that future gains may come from learnable memory management policies that further optimize the balance between appearance modeling and distractor suppression.

Abstract

Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them.

Paper Structure

This paper contains 21 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: SAM2.1++ distractor-aware memory (DAM) update is triggered by the divergence between the predicted and the alternative masks (top left). This resolves the visual ambiguity and increases tracking robustness (bottom). DAM leads to a significant performance boost, setting a new sota on VOT2022 (top-right).
  • Figure 2: Overview of the SAM2 memory and the proposed Distractor-Aware Memory (DAM), which splits the model into Recent Appearance Memory (RAM) and Distractor Resolving Memory (DRM) and updates them by a new memory management protocol.
  • Figure 3: Example frames from the DiDi dataset showing challenging distractors. Targets are denoted by green bounding boxes.
  • Figure 4: Accuracy-robustness plot on DiDi for the ablated versions of SAM2.1++. The tracking quality is given at each label.
  • Figure 5: SAM2.1++ qualitative results on the DiDi dataset with predicted masks shown in green, and tracked objects denoted by arrows. Per-frame overlaps are shown above the figures to indicate failure-free tracking over the entire sequence.
  • ...and 2 more figures