OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, Priyadarshini Panda
TL;DR
OpenWorldSAM tackles open-vocabulary image segmentation by extending SAM2 with a lightweight language adapter that fuses a frozen vision-language encoder (BEiT-3) and a set of learnable tie-breakers and a cross-attention based soft prompt, enabling single-text prompts to disambiguate and segment multiple instances. The method preserves SAM2’s backbone and segmentation capability while adding only about $4.5$ million trainable parameters, achieving strong zero-shot and referring-expression performance across ADE20K, PASCAL Context, ScanNet, and RefCOCOg. Key contributions include (i) positional tie-breakers for multi-instance separation, (ii) soft prompting via cross-attention to ground language in image features, and (iii) a unified, prompt-driven interface for semantic, instance, panoptic, and referring-segmentation, with oracle-prompt evaluation proposed for fair comparison. The results demonstrate state-of-the-art zero-shot open-vocabulary segmentation with much lower trainable-parameter counts and preserve interactive capabilities, pointing to practical deployment in real-world, open-world perception systems.
Abstract
The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.
