Table of Contents
Fetching ...

PVPUFormer: Probabilistic Visual Prompt Unified Transformer for Interactive Image Segmentation

Xu Zhang, Kailun Yang, Jiacheng Lin, Jin Yuan, Zhiyong Li, Shutao Li

TL;DR

A simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting.

Abstract

Integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation significantly facilitates users' interaction as well as improves interaction efficiency. However, existing studies primarily encode the position or pixel regions of prompts without considering the contextual areas around them, resulting in insufficient prompt feedback, which is not conducive to performance acceleration. To tackle this problem, this paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting. Specifically, we first propose a Probabilistic Prompt-unified Encoder (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt contextual information, offering richer feedback cues to accelerate performance improvement. On this basis, we further present a Prompt-to-Pixel Contrastive (P$^2$C) loss to accurately align both prompt and pixel features, bridging the representation gap between them to offer consistent feature representations for mask prediction. Moreover, our approach designs a Dual-cross Merging Attention (DMA) module to implement bidirectional feature interaction between image and prompt features, generating notable features for performance improvement. A comprehensive variety of experiments on several challenging datasets demonstrates that the proposed components achieve consistent improvements, yielding state-of-the-art interactive segmentation performance. Our code is available at https://github.com/XuZhang1211/PVPUFormer.

PVPUFormer: Probabilistic Visual Prompt Unified Transformer for Interactive Image Segmentation

TL;DR

A simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting.

Abstract

Integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation significantly facilitates users' interaction as well as improves interaction efficiency. However, existing studies primarily encode the position or pixel regions of prompts without considering the contextual areas around them, resulting in insufficient prompt feedback, which is not conducive to performance acceleration. To tackle this problem, this paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting. Specifically, we first propose a Probabilistic Prompt-unified Encoder (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt contextual information, offering richer feedback cues to accelerate performance improvement. On this basis, we further present a Prompt-to-Pixel Contrastive (PC) loss to accurately align both prompt and pixel features, bridging the representation gap between them to offer consistent feature representations for mask prediction. Moreover, our approach designs a Dual-cross Merging Attention (DMA) module to implement bidirectional feature interaction between image and prompt features, generating notable features for performance improvement. A comprehensive variety of experiments on several challenging datasets demonstrates that the proposed components achieve consistent improvements, yielding state-of-the-art interactive segmentation performance. Our code is available at https://github.com/XuZhang1211/PVPUFormer.
Paper Structure (20 sections, 15 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 15 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of different prompt encoding strategies, where the two-dimensional prompt encoding (subfigure (a)) introduces irrelevant information, the one-dimensional prompt encoding (subfigure (b) and (c)) ignores contextual regions usually of interest to users. Our prompt encoding (subfigure (d)) adopts a probabilistic estimation way to encode both prompt and non-prompt information and could convert clicks, boxes and scribbles into a unified probability representation (see subfigure (e), the darker the color, the higher the probability), offering richer feedback cues for performance boosting.
  • Figure 2: The pipeline of the proposed Probabilistic Visual Prompt Unified Transformer (PVPUFormer), which consists of four components: a Probabilistic Prompt-unified Encoder (PPuE), an Image Encoder, a Dual-cross Merging Attention (DMA) module, and a Multi-scale Feature Decoder.
  • Figure 3: Three examples to show the click, box, and scribble encoding by the PPuE, respectively, where the PPuE constructs a one-dimension prompt vector $q$ to represent a visual prompt, composing of three parts: a horizontal representation vector $q_h$, a vertical representation vector $q_v$, and an intention property vector $q_b$.
  • Figure 4: Comparisons of the mIoU-NoC curves on four datasets by different approaches.
  • Figure 5: Qualitative comparisons of segmentation results by different approaches ( RITM RITM_sofiiuk2022reviving, CDNet CDNet_chen2021conditional, FocusClick FocalClick_chen2022focalclick, and our method) on three difficult examples, where the first example in the first two columns has a spidery golf clue to be masked, the second example in the middle two columns has similar foreground and background colors, and the third example has partially occluded object components.
  • ...and 2 more figures