Table of Contents
Fetching ...

Matching Anything by Segmenting Anything

Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, Fisher Yu

TL;DR

This work introduces MASA, a universal, SAM-driven framework that learns cross-domain, instance-level object association from unlabeled images to enable zero-shot tracking of any detected objects. By generating dense instance proposals with SAM and training a contrastive, instance-aware representation via two-view augmentations, MASA achieves strong cross-domain generalization without video annotations. The MASA adapter extends frozen segmentation/detection backbones, distilling SAM's localization priors and learning discriminative embeddings, and can be used to detect, segment, and track everything in a unified pipeline. Across MOT/MOTS benchmarks and open-world tasks, MASA delivers state-of-the-art or competitive zero-shot performance, highlighting its potential for robust, domain-agnostic tracking in real-world scenarios.

Abstract

The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT). Current methods predominantly rely on labeled domain-specific video datasets, which limits the cross-domain generalization of learned similarity embeddings. We propose MASA, a novel method for robust instance association learning, capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM), MASA learns instance-level correspondence through exhaustive data transformations. We treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. We further design a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects. Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method, using only unlabeled static images, achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences, in zero-shot association. Project Page: https://matchinganything.github.io/

Matching Anything by Segmenting Anything

TL;DR

This work introduces MASA, a universal, SAM-driven framework that learns cross-domain, instance-level object association from unlabeled images to enable zero-shot tracking of any detected objects. By generating dense instance proposals with SAM and training a contrastive, instance-aware representation via two-view augmentations, MASA achieves strong cross-domain generalization without video annotations. The MASA adapter extends frozen segmentation/detection backbones, distilling SAM's localization priors and learning discriminative embeddings, and can be used to detect, segment, and track everything in a unified pipeline. Across MOT/MOTS benchmarks and open-world tasks, MASA delivers state-of-the-art or competitive zero-shot performance, highlighting its potential for robust, domain-agnostic tracking in real-world scenarios.

Abstract

The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT). Current methods predominantly rely on labeled domain-specific video datasets, which limits the cross-domain generalization of learned similarity embeddings. We propose MASA, a novel method for robust instance association learning, capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM), MASA learns instance-level correspondence through exhaustive data transformations. We treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. We further design a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects. Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method, using only unlabeled static images, achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences, in zero-shot association. Project Page: https://matchinganything.github.io/
Paper Structure (35 sections, 3 equations, 16 figures, 16 tables, 1 algorithm)

This paper contains 35 sections, 3 equations, 16 figures, 16 tables, 1 algorithm.

Figures (16)

  • Figure 1: Given an unlabeled image from any domain, we apply strong augmentations, $\varphi(\cdot)$ and $\phi(\cdot)$, to the image, generating two different views with automatically established pixel correspondences. Then, we leverage the rich object-level information encoded by the foundation segmentation model SAM to transfer the pixel-level to dense instance-level correspondence. Such correspondences enable us to utilize a diverse collection of unlabeled images to train a universal tracking adapter atop any segmentation or detection foundation models e.g. SAM. This adapter empowers the foundational models to track any objects they have detected, and shows strong zero-shot tracking ability in complex domains.
  • Figure 2: MASA training pipeline. Given an unlabeled image from any domain, SAM automatically generates exhaustive instance masks for it. Then we apply strong augmentations, $\phi(\cdot)$ and $\varphi(\cdot)$, to the original image and exhaustive instance segmentation, obtaining two different views as the inputs of our model. We train our MASA adapter by joint distillation of SAM's detection knowledge and instance similarity learning. Better view in color with zoom-in.
  • Figure 3: The inference pipeline of our unified methods.
  • Figure 4: Comparison on the UVO UVO dataset. (a) We evaluate class-agnostic object detection and video object tracking results with our MASA. Both object localization and association achieve promising performance compared with previous in-domain training methods. (b) We compare the inference time (s) with the original SAM by sampling different numbers of prompt points. Our detection head learns to localize all the potential objects effectively.
  • Figure 5: Qualitative results of our unified models using Ours-Grounding-DINO (top) and Ours-SAM-H (bottom). We use SAM-H to generate masks given the detected boxes.
  • ...and 11 more figures