Table of Contents
Fetching ...

Refer to Any Segmentation Mask Group With Vision-Language Prompts

Shengcao Cao, Zijun Wei, Jason Kuen, Kangning Liu, Lingzhi Zhang, Jiuxiang Gu, HyunJoon Jung, Liang-Yan Gui, Yu-Xiong Wang

TL;DR

This paper introduces omnimodal referring expression segmentation (ORES), a task that outputs groups of segmentation masks from vision-language prompts. It proposes the Refer to Any Segmentation Mask Group (RAS) framework, which couples segmentation-proposed mask pools with a mask-centric large multimodal model and non-autoregressive decoding to produce flexible mask groups. Two datasets, MaskGroups-2M and MaskGroups-HQ, enable large-scale instruction tuning and high-quality evaluation, respectively. Across ORES, RES, and GRES benchmarks, RAS achieves state-of-the-art performance on ORES and competitive results on RES/GRES, demonstrating strong cross-task generalization and practical utility for fine-grained image editing and interaction.

Abstract

Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

Refer to Any Segmentation Mask Group With Vision-Language Prompts

TL;DR

This paper introduces omnimodal referring expression segmentation (ORES), a task that outputs groups of segmentation masks from vision-language prompts. It proposes the Refer to Any Segmentation Mask Group (RAS) framework, which couples segmentation-proposed mask pools with a mask-centric large multimodal model and non-autoregressive decoding to produce flexible mask groups. Two datasets, MaskGroups-2M and MaskGroups-HQ, enable large-scale instruction tuning and high-quality evaluation, respectively. Across ORES, RES, and GRES benchmarks, RAS achieves state-of-the-art performance on ORES and competitive results on RES/GRES, demonstrating strong cross-task generalization and practical utility for fine-grained image editing and interaction.

Abstract

Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

Paper Structure

This paper contains 21 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Omnimodal referring expression segmentation (ORES) according to arbitrary vision-language prompts.(Left) Our approach, Refer to Any Segmentation Mask Group (RAS), can understand a complex text prompt involving multiple conditions. (Middle) Reference visual entities can be included as visual prompts to enhance expressivity, addressing the challenge of describing the same details using language alone. (Right) The grouped segmentation masks conveniently enable various fine-grained downstream applications, such as object removal and editing. In each pair of images, the left one is the input and the right one is the output. Best viewed on an electronic device with zoom-in functionality.
  • Figure 2: Overview of our Refer to Any Segmentation Mask Group (RAS) framework. We extend LLaVA-1.5 liu2024improved with a segmentation model, a visual encoder ensemble, mask tokenization, and a binary selection classifier for mask grouping. The decoding procedure of the LLM is non-autoregressive carion2020end, as the input tokens are given as candidate mask tokens rather than predicted from previous tokens.
  • Figure 3: Examples of MaskGroups-HQ. Diverse vision-language prompts are included, involving object categories, attributes, positions, comparisons, interactions, etc. Best viewed on an electronic device with zoom-in functionality.
  • Figure 4: Fine-grained image content manipulation enabled by our approach. In each row we visualize the original image, the predicted segmentation masks, and the object removal (first two rows) or editing (last two rows) results. Best viewed on an electronic device with zoom-in functionality.
  • Figure A: Prompt type distribution in MaskGroups-HQ. A grouping criterion may involve the categories, the attributes, the absolute or relative positions, the cross-entity comparisons, and even their combination.
  • ...and 3 more figures