Table of Contents
Fetching ...

A Simple Image Segmentation Framework via In-Context Examples

Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, Chunhua Shen

TL;DR

SINE, a simple image Segmentation framework utilizing in-context examples, introduces an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks.

Abstract

Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image Segmentation framework utilizing in-context examples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to effectively eliminate task ambiguity. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.

A Simple Image Segmentation Framework via In-Context Examples

TL;DR

SINE, a simple image Segmentation framework utilizing in-context examples, introduces an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks.

Abstract

Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image Segmentation framework utilizing in-context examples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to effectively eliminate task ambiguity. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.
Paper Structure (25 sections, 4 equations, 13 figures, 10 tables)

This paper contains 25 sections, 4 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Illustration of ambiguity in traditional in-context segmentation framework and an overview of our SINE framework.
  • Figure 2: An overview of SINE. SINE is a Transformer encoder-decoder structure, including a frozen pre-trained image encoder, an In-Context Interaction module, and a lightweight Matching Transformer (M-Former) decoder. The top right corner is the self-attention mask of M-Former. The down right corner shows the different task outputs, from identical object, instance, to semantic.
  • Figure 2: Results (AP) of few-shot object detection and instance segmentation on COCO-NOVEL with $K=\{1, 5\}$.
  • Figure 3: Illustration of the In-Context Interaction module. This module aims to complement in-context information between reference and target. The ID and semantic tokens are extracted by the Mask Pooling. The enhanced target feature, the ID queries, and the semantic prototypes are outputted by the In-Context Fusion module.
  • Figure 4: Qualitative results of SINE. (a) Comparison between SegGPT and SINE for addressing ambiguity in in-context segmentation. (b) Few-shot semantic segmentation. (c) Few-shot instance segmentation. (d) Video object segmentation.
  • ...and 8 more figures