Table of Contents
Fetching ...

Omni-Referring Image Segmentation

Qiancheng Zheng, Yunhang Shen, Gen Luo, Baiyang Song, Xing Sun, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

TL;DR

<3-5 sentence high-level summary> OmniRIS introduces a generalized image segmentation paradigm that unifies text and visual prompts, enabling flexible one-vs-one, one-vs-many, many-vs-many, and no-target settings. The authors present OmniRef, a large-scale dataset with omni-prompts and three test splits, and OmniSegNet, a baseline model with an omni-prompt encoder and a three-stage training regime to learn cross-modal grounding. Quantitative and qualitative experiments show OmniSegNet performs well across text-only, visual-only, and omni-modal prompts and generalizes to existing RIS benchmarks and one-shot scenarios. This work demonstrates the value of combining granular attribute referring with cross-image grounding for highly interactive and generalized segmentation tasks.

Abstract

In this paper, we propose a novel task termed Omni-Referring Image Segmentation (OmniRIS) towards highly generalized image segmentation. Compared with existing unimodally conditioned segmentation tasks, such as RIS and visual RIS, OmniRIS supports the input of text instructions and reference images with masks, boxes or scribbles as omni-prompts. This property makes it can well exploit the intrinsic merits of both text and visual modalities, i.e., granular attribute referring and uncommon object grounding, respectively. Besides, OmniRIS can also handle various segmentation settings, such as one v.s. many and many v.s. many, further facilitating its practical use. To promote the research of OmniRIS, we also rigorously design and construct a large dataset termed OmniRef, which consists of 186,939 omni-prompts for 30,956 images, and establish a comprehensive evaluation system. Moreover, a strong and general baseline termed OmniSegNet is also proposed to tackle the key challenges of OmniRIS, such as omni-prompt encoding. The extensive experiments not only validate the capability of OmniSegNet in following omni-modal instructions, but also show the superiority of OmniRIS for highly generalized image segmentation.

Omni-Referring Image Segmentation

TL;DR

<3-5 sentence high-level summary> OmniRIS introduces a generalized image segmentation paradigm that unifies text and visual prompts, enabling flexible one-vs-one, one-vs-many, many-vs-many, and no-target settings. The authors present OmniRef, a large-scale dataset with omni-prompts and three test splits, and OmniSegNet, a baseline model with an omni-prompt encoder and a three-stage training regime to learn cross-modal grounding. Quantitative and qualitative experiments show OmniSegNet performs well across text-only, visual-only, and omni-modal prompts and generalizes to existing RIS benchmarks and one-shot scenarios. This work demonstrates the value of combining granular attribute referring with cross-image grounding for highly interactive and generalized segmentation tasks.

Abstract

In this paper, we propose a novel task termed Omni-Referring Image Segmentation (OmniRIS) towards highly generalized image segmentation. Compared with existing unimodally conditioned segmentation tasks, such as RIS and visual RIS, OmniRIS supports the input of text instructions and reference images with masks, boxes or scribbles as omni-prompts. This property makes it can well exploit the intrinsic merits of both text and visual modalities, i.e., granular attribute referring and uncommon object grounding, respectively. Besides, OmniRIS can also handle various segmentation settings, such as one v.s. many and many v.s. many, further facilitating its practical use. To promote the research of OmniRIS, we also rigorously design and construct a large dataset termed OmniRef, which consists of 186,939 omni-prompts for 30,956 images, and establish a comprehensive evaluation system. Moreover, a strong and general baseline termed OmniSegNet is also proposed to tackle the key challenges of OmniRIS, such as omni-prompt encoding. The extensive experiments not only validate the capability of OmniSegNet in following omni-modal instructions, but also show the superiority of OmniRIS for highly generalized image segmentation.

Paper Structure

This paper contains 23 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of the proposed Omni-Referring Image Segmentation (OmniRIS) task. OmniRIS can support text expressions (1,2) and reference images with mask (3), box (4) or scribble (5) prompts under the settings of one v.s. one, one v.s. many, many v.s. many or no-target.
  • Figure 2: Statistical comparison between our OmniRef dataset and existing popular RIS datasets. *The reference images (26,859) of OmniRef are not counted.
  • Figure 3: Comparison between OmniRIS and existing unimodal referring segmentation tasks. RIS (a) segments the visual referent corresponding to the given text expression, based on which GRES (b) extends its types of outputs, such as one v.s. many and no-target. Visual RIS (c) refers to the grounding based on the referenced instances in another image. Compared with existing RIS tasks, the proposed OmniRIS (d) merges the merits of text and visual information to support more flexible image segmentation.
  • Figure 4: Statistics of the training set and three test splits of OmniRef. Those splits all involve the segmentation outputs of single-target, multi-target and no-target, as well as the cases of one v.s. one, one v.s. many, many v.s. one, many v.s. many and no-target.
  • Figure 5: The construction pipeline of the proposed OmniRef dataset, which consists of four main steps. Step I filters out the images that have too few objects and lack visual diversity, and selects the ones containing multiple categories and spatially well-distributed objects as the target and reference images. Step II pairs the target and reference images to construct visual prompts for different segmentation cases. Step III aligns text prompts with target images that involve diverse cases, including the single-target, multi-target and no-target outputs with long and complex expressions. Step IV merges visual and text annotations to form the final omni-modal examples. After the four steps, manual checking is also conducted to ensure the quality of examples.
  • ...and 3 more figures