ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
Kunyang Han, Yibo Hu, Mengxue Qu, Hailin Shi, Yao Zhao, Yunchao Wei
TL;DR
ROSE tackles open-set dense segmentation by eliminating predefined category prompts and predicting dense masks through patch-wise perception. It integrates a vision–language model framework with an instruction-response paradigm for open-category generation and a conversation-based refinement loop to iteratively improve masks and categories. The method extracts objectness, mask embeddings, and category embeddings at the patch level, decodes masks with SAM, and generates categories via LLM-driven prompts, achieving competitive results across semantic, instance, and referring tasks. This approach demonstrates a scalable path toward truly open-set dense segmentation across diverse domains, with CSR providing substantial performance boosts.
Abstract
Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.
