Unified Open-World Segmentation with Multi-Modal Prompts
Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, Chunhua Shen
TL;DR
Open-world segmentation has been tackled separately via open-vocabulary and in-context approaches, limiting generalization across diverse tasks. This work presents COSINE, a unified framework that couples a Model Pool of frozen foundation models with a decoder-only SegDecoder, augmented by an Image-Prompt Aligner and a Multi-Modality Decoder to jointly process text and image prompts. The approach enables segmentation across semantic, instance, panoptic, referring, and video-object tasks with significant improvements over many baselines and strong evidence of cross-modal prompt synergy. By freezing foundation models and training a lightweight decoder, COSINE achieves broad generalization with reduced training cost, offering a practical path toward universal open-world perception.
Abstract
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
