Table of Contents
Fetching ...

Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

Yu-Jhe Li, Xinyang Zhang, Kun Wan, Lantao Yu, Ajinkya Kale, Xin Lu

TL;DR

This work tackles open-vocabulary segmentation by enabling text prompts to guide the first-stage mask proposals. It introduces Prompt-guided Mask Proposal (PMP), which injects text tokens into the transformer-based mask decoder via a text-query cross-attention mechanism to produce prompt-specific mask embeddings. When integrated with existing two-stage backbones (e.g., Mask2Former-based systems), PMP yields consistent mIoU gains (about 1–3 percentage points) across five benchmark datasets and supports prompts of varying complexity, including abstract or proprietary terms. The approach is lightweight, modular, and demonstrates robust generalization to novel prompts while maintaining practical inference efficiency, making it suitable for real-world open-vocabulary segmentation tasks.

Abstract

We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments, using text prompts as our input. To overcome this challenge, existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space to bridge the gap between limited and extensive vocabulary recognition, resulting in a two-stage approach: In the first stage, a mask generator takes an input image to generate mask proposals, and the in the second stage the target mask is picked based on the query. However, the expected target mask may not exist in the generated mask proposals, which leads to an unexpected output mask. In our work, we propose a novel approach named Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these prompts. Compared with mask proposals generated without input prompts, masks generated by PMP are better aligned with the input prompts. To realize PMP, we designed a cross-attention mechanism between text tokens and query tokens which is capable of generating prompt-guided mask proposals after each decoding. We combined our PMP with several existing works employing a query-based segmentation backbone and the experiments on five benchmark datasets demonstrate the effectiveness of this approach, showcasing significant improvements over the current two-stage models (1% ~ 3% absolute performance gain in terms of mIOU). The steady improvement in performance across these benchmarks indicates the effective generalization of our proposed lightweight prompt-aware method.

Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

TL;DR

This work tackles open-vocabulary segmentation by enabling text prompts to guide the first-stage mask proposals. It introduces Prompt-guided Mask Proposal (PMP), which injects text tokens into the transformer-based mask decoder via a text-query cross-attention mechanism to produce prompt-specific mask embeddings. When integrated with existing two-stage backbones (e.g., Mask2Former-based systems), PMP yields consistent mIoU gains (about 1–3 percentage points) across five benchmark datasets and supports prompts of varying complexity, including abstract or proprietary terms. The approach is lightweight, modular, and demonstrates robust generalization to novel prompts while maintaining practical inference efficiency, making it suitable for real-world open-vocabulary segmentation tasks.

Abstract

We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments, using text prompts as our input. To overcome this challenge, existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space to bridge the gap between limited and extensive vocabulary recognition, resulting in a two-stage approach: In the first stage, a mask generator takes an input image to generate mask proposals, and the in the second stage the target mask is picked based on the query. However, the expected target mask may not exist in the generated mask proposals, which leads to an unexpected output mask. In our work, we propose a novel approach named Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these prompts. Compared with mask proposals generated without input prompts, masks generated by PMP are better aligned with the input prompts. To realize PMP, we designed a cross-attention mechanism between text tokens and query tokens which is capable of generating prompt-guided mask proposals after each decoding. We combined our PMP with several existing works employing a query-based segmentation backbone and the experiments on five benchmark datasets demonstrate the effectiveness of this approach, showcasing significant improvements over the current two-stage models (1% ~ 3% absolute performance gain in terms of mIOU). The steady improvement in performance across these benchmarks indicates the effective generalization of our proposed lightweight prompt-aware method.

Paper Structure

This paper contains 39 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The significance of prompt-guided mask proposals for open-vocabulary segmentation. Compared with the previous work (liang2023open as an example in the middle image), our proposed mask proposals with input prompt guidance contain the reasonable segmentation mask, which allows the CLIP model to retrieve the proprietary prompt such as "Yellowstone". Faces are masked out for the privacy reason.
  • Figure 2: Overview of the proposed prompt-guided mask proposal (PMP) in the two-stage pipeline for open vocabulary segmentation. The entire pipeline contains an image encoder $E_I$, a pixel decoder $E_P$, a text encoder $E_T$, and a transformer decoder $E_D$. We utilize the query-based transformer decoder $E_D$ to produce the $N$ mask embeddings $\{ {z}_i \}_{i=1}^N$ given $N$ queries $\{ {q}_i \}_{i=1}^N$. Specifically, our PMP built on top of the transformer decoder takes the N query tokens and the essential text tokens $\{ {t}_j \}_{j=1}^M$, where $M$ varies depending on the number of given prompts, to produce the $N$ mask embeddings from a given image ($I$). It consists of a stack of layers, each is built with a text-query cross-attention block followed by a standard decoding block. The image encoder $E_I$ is introduced to obtain a visual-spatial feature $f_I$ of the entire image for the transformer encoder to obtain the $N$ mask embeddings from an image. The transformer decoder is also able to take multi-level pixel embeddings $f_P$ generated by the introduced pixel decoder $E_P$ for improved generalization. The generated mask embeddings can be transformed into mask proposals $\{ \Tilde{m}_i \}_{i=1}^{N}$ by the multiplication with the pixel embeddings $f_P$. The class labels $\{ \Tilde{c}_i \}_{i=1}^N$ are also produced by these mask embeddings with the pre-trained language model (e.g., CLIP).
  • Figure 3: Qualitative results of open-vocabulary segmentation on our taken seven example real images. The input prompts contain more than just the object class such as abstract word or proprietary word. We compare with the previous approach OVSeg liang2023open. We'll present more results in the supplementary. Faces are masked out for the privacy reason.
  • Figure 4: Comparison of our model with SAM kirillov2023segment (+CLIP).
  • Figure 5: Illustration of four different strategies of decoding queries with input text tokens. These strategies include (a) Concatenate, (b) Concatenate and drop, (c) Text tokens as queries, and (d) our proposed cross-attention.
  • ...and 3 more figures