Table of Contents
Fetching ...

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, Chunhua Shen

TL;DR

Matcher presents a training-free framework that harnesses vision foundation models to segment anything from a single in-context example, integrating DINOv2 for dense patch matching and SAM for mask prediction. Through three universal components—Correspondence Matrix Extraction, Prompts Generation, and Controllable Masks Generation—it achieves strong generalization across one-shot semantic segmentation, one-shot object part segmentation, and video object segmentation, with notable results such as 52.7% mIoU on COCO-20^i (one-shot) and 33.0% on LVIS-92^i (one-shot). Ablation studies validate the necessity of bidirectional matching and the instance-level matching framework based on OT/EMD, while qualitative results demonstrate robust open-world segmentation and controllable mask outputs. This training-free paradigm demonstrates that combining off-the-shelf VFMs with careful prompting and matching strategies can rival task-specific or training-based approaches, potentially accelerating open-world perception research and providing a new evaluation lens for vision foundation models.

Abstract

Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

TL;DR

Matcher presents a training-free framework that harnesses vision foundation models to segment anything from a single in-context example, integrating DINOv2 for dense patch matching and SAM for mask prediction. Through three universal components—Correspondence Matrix Extraction, Prompts Generation, and Controllable Masks Generation—it achieves strong generalization across one-shot semantic segmentation, one-shot object part segmentation, and video object segmentation, with notable results such as 52.7% mIoU on COCO-20^i (one-shot) and 33.0% on LVIS-92^i (one-shot). Ablation studies validate the necessity of bidirectional matching and the instance-level matching framework based on OT/EMD, while qualitative results demonstrate robust open-world segmentation and controllable mask outputs. This training-free paradigm demonstrates that combining off-the-shelf VFMs with careful prompting and matching strategies can rival task-specific or training-based approaches, potentially accelerating open-world perception research and providing a new evaluation lens for vision foundation models.

Abstract

Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20 with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92 for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.
Paper Structure (18 sections, 3 equations, 12 figures, 8 tables)

This paper contains 18 sections, 3 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: An overview of Matcher. Our training-free framework addresses various segmentation tasks through three operations: Correspondence Matrix Extraction, Prompts Generation, and Controllable Masks Generation.
  • Figure 2: Illustration of the proposed bidirectional matching. Bidirectional matching consists of three steps: forward matching, reverse matching, and mask filtering. Purple points denote the matched points. Red points denote the outliers.
  • Figure 3: Qualitative results of one-shot segmentation.
  • Figure 4: Qualitative results of video object segmentation on DAVIS 2017.
  • Figure 5: Illustration of the effects of the purity and coverage.
  • ...and 7 more figures