Table of Contents
Fetching ...

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang

TL;DR

SEGIC presents an end-to-end in-context segmentation framework that exploits emergent dense correspondences within a frozen vision foundation model to transfer segmentation knowledge from a few in-context exemplars. By encoding in-context information as geometric, visual, and meta instructions and decoding with a lightweight query-based mask decoder, SEGIC segments novel targets without backbone fine-tuning. It achieves state-of-the-art results on one-shot benchmarks such as COCO-20^i, FSS-1000, and LVIS-92^i, and demonstrates competitive performance on video object segmentation and open-vocabulary segmentation, all with data-efficient training. Ablations reveal the importance of high-resolution pre-training, the content of in-context instructions, and augmentation strategies, underscoring SEGIC’s potential for universal, low-cost segmentation across tasks.

Abstract

In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target. The resulting models can be generalized seamlessly to novel segmentation tasks, significantly reducing the labeling and training costs compared with conventional pipelines. However, in-context segmentation is more challenging than classic ones requiring the model to learn segmentation rules conditioned on a few samples. Unlike previous work with ad-hoc or non-end-to-end designs, we propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM). In particular, SEGIC leverages the emergent correspondence within VFM to capture dense relationships between target images and in-context samples. As such, information from in-context samples is then extracted into three types of instructions, i.e. geometric, visual, and meta instructions, serving as explicit conditions for the final mask prediction. SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks. Notably, SEGIC can be easily generalized to diverse tasks, including video object segmentation and open-vocabulary segmentation. Code will be available at https://github.com/MengLcool/SEGIC.

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

TL;DR

SEGIC presents an end-to-end in-context segmentation framework that exploits emergent dense correspondences within a frozen vision foundation model to transfer segmentation knowledge from a few in-context exemplars. By encoding in-context information as geometric, visual, and meta instructions and decoding with a lightweight query-based mask decoder, SEGIC segments novel targets without backbone fine-tuning. It achieves state-of-the-art results on one-shot benchmarks such as COCO-20^i, FSS-1000, and LVIS-92^i, and demonstrates competitive performance on video object segmentation and open-vocabulary segmentation, all with data-efficient training. Ablations reveal the importance of high-resolution pre-training, the content of in-context instructions, and augmentation strategies, underscoring SEGIC’s potential for universal, low-cost segmentation across tasks.

Abstract

In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target. The resulting models can be generalized seamlessly to novel segmentation tasks, significantly reducing the labeling and training costs compared with conventional pipelines. However, in-context segmentation is more challenging than classic ones requiring the model to learn segmentation rules conditioned on a few samples. Unlike previous work with ad-hoc or non-end-to-end designs, we propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM). In particular, SEGIC leverages the emergent correspondence within VFM to capture dense relationships between target images and in-context samples. As such, information from in-context samples is then extracted into three types of instructions, i.e. geometric, visual, and meta instructions, serving as explicit conditions for the final mask prediction. SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks. Notably, SEGIC can be easily generalized to diverse tasks, including video object segmentation and open-vocabulary segmentation. Code will be available at https://github.com/MengLcool/SEGIC.
Paper Structure (16 sections, 7 equations, 7 figures, 6 tables)

This paper contains 16 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Qualitative results of SegIC. SegIC segments target images (the bottom row) according to a few labeled example images (top row, linked by in the figure), termed as "in-context segmentation". SegIC unifies various segmentation tasks via different types of in-context samples, including those annotated with one mask per sample (one-shot segmentation), annotated with a few masks per sample (video object segmentation), and the combination of annotated samples (semantic segmentation)
  • Figure 2: Architecture overview.SegIC is built upon a frozen vision foundation model, consisting of four stages: (1) feature extraction; (2) correspondence discovery (Section \ref{['sec:dense_corr']}); (3) in-context instruction extraction (Section \ref{['sec:icl_features']}); (4) mask decoding (Section \ref{['sec:mask_decoding']}).
  • Figure 3: Performance on zero-shot semantic correspondence and in-context segmentation. We use the dense feature similarity for correspondence estimation. The diameter of each bubble represents the number of parameters of each model.
  • Figure 4: Visualization of propagated masks. We propagate labels from the in-context examples to the targets to obtain propagated masks by exploring the dense correspondences. We employ DINO-v2-large dinov2 for the visualization.
  • Figure 5: Qualitative results on VOS.SegIC perform well on challenging scenarios in video object segmentation, including (a) occlusions, (b) interwoven objects, and (c) small objects.
  • ...and 2 more figures