Table of Contents
Fetching ...

ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

Kunyang Han, Yibo Hu, Mengxue Qu, Hailin Shi, Yao Zhao, Yunchao Wei

TL;DR

ROSE tackles open-set dense segmentation by eliminating predefined category prompts and predicting dense masks through patch-wise perception. It integrates a vision–language model framework with an instruction-response paradigm for open-category generation and a conversation-based refinement loop to iteratively improve masks and categories. The method extracts objectness, mask embeddings, and category embeddings at the patch level, decodes masks with SAM, and generates categories via LLM-driven prompts, achieving competitive results across semantic, instance, and referring tasks. This approach demonstrates a scalable path toward truly open-set dense segmentation across diverse domains, with CSR providing substantial performance boosts.

Abstract

Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.

ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

TL;DR

ROSE tackles open-set dense segmentation by eliminating predefined category prompts and predicting dense masks through patch-wise perception. It integrates a vision–language model framework with an instruction-response paradigm for open-category generation and a conversation-based refinement loop to iteratively improve masks and categories. The method extracts objectness, mask embeddings, and category embeddings at the patch level, decodes masks with SAM, and generates categories via LLM-driven prompts, achieving competitive results across semantic, instance, and referring tasks. This approach demonstrates a scalable path toward truly open-set dense segmentation across diverse domains, with CSR providing substantial performance boosts.

Abstract

Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.

Paper Structure

This paper contains 25 sections, 7 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of existing open-set segmentation frameworks. Both (a) and (b) require predefined category inputs, where (a) uses similarity matching to select the target category, while (b) generates object masks according to the given category. Consequently, method (a) can perform dense prediction, while (b) is restricted in referring segmentation. Our approach, however, eliminates the need for predefined category inputs and produces dense predictions directly. 'emb': embedding.
  • Figure 2: The architecture of ROSE. (a) In Patch-wise Perception Processes, the vision encoder first encodes the input image and gets patched features, the feature is then concatenated with text instruction and fed into the Large Language model. Then every patch is analyzed by the patch analyzer, generating a mask embedding, a category embedding, and an objectness score. (b) In Patch-wise Mask and Category Decoding Process, patches are first filtered with objectness scores. Then mask embedding is fed into the SAM decoder as a prompt for the patch-corresponding mask. Category embedding is employed to make corresponding category predictions in a generative way.
  • Figure 3: Qualitative results. We show some predictions of ROSE in cross-domain and in-domain scenarios, with generated categories labeled near each target. Please zoom in to see the details. The first row shows the results of images from other domains, including crayon drawings and clip art. The second row shows some predictions of the COCO val set.
  • Figure 4: Visualization of different refinement mechanisms. The first two columns are ground truth and mask expected to be refined. Concat. denotes concatenate mask with image, and Pred. stands for prediction. Mask and Mask+Box are other methods we try.
  • Figure 5: $3$$\times$$3$ super-patch arrangement.
  • ...and 4 more figures