Table of Contents
Fetching ...

Unified Open-World Segmentation with Multi-Modal Prompts

Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, Chunhua Shen

TL;DR

Open-world segmentation has been tackled separately via open-vocabulary and in-context approaches, limiting generalization across diverse tasks. This work presents COSINE, a unified framework that couples a Model Pool of frozen foundation models with a decoder-only SegDecoder, augmented by an Image-Prompt Aligner and a Multi-Modality Decoder to jointly process text and image prompts. The approach enables segmentation across semantic, instance, panoptic, referring, and video-object tasks with significant improvements over many baselines and strong evidence of cross-modal prompt synergy. By freezing foundation models and training a lightweight decoder, COSINE achieves broad generalization with reduced training cost, offering a practical path toward universal open-world perception.

Abstract

In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.

Unified Open-World Segmentation with Multi-Modal Prompts

TL;DR

Open-world segmentation has been tackled separately via open-vocabulary and in-context approaches, limiting generalization across diverse tasks. This work presents COSINE, a unified framework that couples a Model Pool of frozen foundation models with a decoder-only SegDecoder, augmented by an Image-Prompt Aligner and a Multi-Modality Decoder to jointly process text and image prompts. The approach enables segmentation across semantic, instance, panoptic, referring, and video-object tasks with significant improvements over many baselines and strong evidence of cross-modal prompt synergy. By freezing foundation models and training a lightweight decoder, COSINE achieves broad generalization with reduced training cost, offering a practical path toward universal open-world perception.

Abstract

In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: COSINE is a unified open-world segmentation model that consolidates open-vocabulary and in-context segmentation. COSINE can simultaneously support text prompts (green boxes) and image prompts (blue boxes) as inputs to perform various segmentation tasks, including semantic segmentation, instance segmentation, panoptic segmentation, referring segmentation, and video object segmentation. In addition, COSINE can collaboratively use different types of prompts to perform various segmentation tasks.
  • Figure 2: The architecture of COSINE. (a) COSINE consists a Model Pool (e.g., DINOv2 and CLIP) used to extract image and prompt features and a SegDecoder used for unified open-world segmentation tasks. (b) SegDecoder consists a set of adapters, an Image-Prompt Aligmenter, a Pixel decoder and a Multi-Modality Decoder for modality alignment between the image and prompts, effectively enhancing open-world perception modeling. (c) and (d) show the details of Image-Prompt Aligmenter and Multi-Modality Decoder.
  • Figure 3: Qualitative results. COSINE can perform various open-world segmentation tasks with different modal prompts (image and text). For few-shot segmentation, the left image is the example image and the right is the result.
  • Figure 4: Visualization of prompt synergy. The top row shows the input prompts, the bottom row presents the corresponding outputs.
  • Figure S5: Visualizations of in-context segmentation tasks. (a) Example-based semantic segmentation on LVIS dataset. The left image with the blue mask is the image example, and the right image with the green mask is the result. (b) Example-based instance segmentation on LVIS dataset. We will obtain instance outputs sharing the same classes with the given image prompt. (c) Video object segmentation on the YouTuBe-VOS 2019 dataset.
  • ...and 1 more figures