Table of Contents
Fetching ...

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang

TL;DR

This work tackles Generalized Referring Expression Segmentation (GRES), where prompts can refer to multiple targets or to targets absent from the image. It proposes Generalized Segmentation Vision Assistant (GSVA), which extends prior MLLM-based segmentation by learning weight-sharing multiple [SEG] tokens for multiple targets and introducing a [REJ] token to explicitly reject empty targets, connecting a Multimodal Large Language Model with a high-resolution Segmentation Foundation Model. Key contributions include (1) a novel prompt design enabling multiple segmentation queries, (2) a clean empty-target rejection mechanism, (3) empirical state-of-the-art performance on gRefCOCO for GRES, and (4) strong results on classic RES and REC tasks, plus comprehensive ablations and visualizations. The approach improves robustness in real-world scenarios such as embodied AI, where instructions may reference several objects or none at all, by leveraging in-context learning cues and explicit rejection signaling within a unified output framework.

Abstract

Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically, GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask references simultaneously and innovatively learns to generate a [REJ] token to reject the null targets explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue, marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring segmentation and comprehension tasks.

GSVA: Generalized Segmentation via Multimodal Large Language Models

TL;DR

This work tackles Generalized Referring Expression Segmentation (GRES), where prompts can refer to multiple targets or to targets absent from the image. It proposes Generalized Segmentation Vision Assistant (GSVA), which extends prior MLLM-based segmentation by learning weight-sharing multiple [SEG] tokens for multiple targets and introducing a [REJ] token to explicitly reject empty targets, connecting a Multimodal Large Language Model with a high-resolution Segmentation Foundation Model. Key contributions include (1) a novel prompt design enabling multiple segmentation queries, (2) a clean empty-target rejection mechanism, (3) empirical state-of-the-art performance on gRefCOCO for GRES, and (4) strong results on classic RES and REC tasks, plus comprehensive ablations and visualizations. The approach improves robustness in real-world scenarios such as embodied AI, where instructions may reference several objects or none at all, by leveraging in-context learning cues and explicit rejection signaling within a unified output framework.

Abstract

Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically, GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask references simultaneously and innovatively learns to generate a [REJ] token to reject the null targets explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue, marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring segmentation and comprehension tasks.
Paper Structure (25 sections, 7 equations, 9 figures, 11 tables)

This paper contains 25 sections, 7 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison of the segmentation masks by LISA lai2023lisa and GSVA, facing the challenges in Generalized Referring Expression Segmentation (GRES) liu2023gres. (a) LISA fails to segment the correct targets when multiple targets are requested due to the single [SEG] token restriction. GSVA successfully generates all target masks via learning multiple [SEG] tokens. (b) When the referent does not exist in the image, i.e., the empty target is requested, LISA reluctantly produces the wrong mask because of the compulsive [SEG] token output. In contrast, GSVA can reject the empty targets by predicting [REJ] tokens in the output sequence.
  • Figure 2: Overview of GSVA. At the bottom of the figure, the MLLM encodes the input image and concatenates the tokenized text tokens to follow instructions. GSVA generates multiple [SEG] tokens to handle multiple referred targets and rejects the objects absent in the image through [REJ] tokens. At the top of the figure, the SFM also encodes the image for segmentation and selects all [SEG] tokens in the output sequence to prompt the mask decoder to segment the target objects referred to in the instructions.
  • Figure 3: Example of the prompts and predicted masks of GSVA-Vicuna-7B drawn from gRefCOCO validation set. (a) depicts the multiple-target case, in which two zebras referred to are handled with two separate [SEG] tokens. (b) shows the empty-target case, where no apple is in the bowl. Thus, the null referent is rejected with a [REJ] token, and no segmentation mask will be generated.
  • Figure 4: Visualizations of GSVA and LISA lai2023lisa in the GRES task. The first row shows LISA's segmentation results, the second row is the masks and rejections of GSVA, and the third row shows the referring expressions in the instructions. In (a) multiple target cases, each target is colored with a specific color. In (b) empty target cases, the images turn darker to highlight the incorrect predictions of LISA. The examples are selected from the gRefCOCO validation set. The masks are generated by the 7B models. Zoom in for the best view.
  • Figure 5: Generalized Reasoning Segmentation Example.
  • ...and 4 more figures