Table of Contents
Fetching ...

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation

Jiayi Lin, Jiabo Huang, Jian Hu, Shaogang Gong

TL;DR

InvSeg addresses open-vocabulary semantic segmentation by closing the distribution gap between rich, image-specific generation prompts and isolated class names. It inverts image context into the diffusion text embedding space at test time, guided by Contrastive Soft Clustering to produce region-level, structure-consistent masks, and stabilizes learning with entropy minimization. The approach yields state-of-the-art results on VOC and PASCAL Context and competitive performance on COCO Object without requiring per-pixel labels, highlighting strong cross-modal alignment and unsupervised region inversion capabilities. This work demonstrates that image-specific prompts learned at test time can substantially improve diffusion-model-based segmentation and unlock robust open-set performance for diverse visual concepts.

Abstract

Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation

TL;DR

InvSeg addresses open-vocabulary semantic segmentation by closing the distribution gap between rich, image-specific generation prompts and isolated class names. It inverts image context into the diffusion text embedding space at test time, guided by Contrastive Soft Clustering to produce region-level, structure-consistent masks, and stabilizes learning with entropy minimization. The approach yields state-of-the-art results on VOC and PASCAL Context and competitive performance on COCO Object without requiring per-pixel labels, highlighting strong cross-modal alignment and unsupervised region inversion capabilities. This work demonstrates that image-specific prompts learned at test time can substantially improve diffusion-model-based segmentation and unlock robust open-set performance for diverse visual concepts.

Abstract

Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Motivation of test-time prompt inversion.
  • Figure 2: Overview of InvSeg framework. Our proposed Contrastive Soft Clustering method can achieve region-level prompt inversion. The text tokens are first initialized with the pretrained text encoder from the diffusion model (dashed box on left) and then are used as the only learnable parameters during the test time training. After the adaption process, the learned text tokens can be used to derive a more accurate and complete refined attention maps $\{M\}$ for segmentation.
  • Figure 3: Illustration of the soft selection (with probability) of anchor points for each category $c$: $Anchor^c$ during different optimization steps (left) and the distance matrix $S$ on certain anchor points (right). On the right sub-figure, we sample 3 anchor points for each category, showing the distance from each anchor point to other pixels in the image. Darker areas represent smaller distances (higher similarity) to the anchor.
  • Figure 4: Examples of Segmentation on VOC (top), Context (middle) and COCO (bottom). For each sample (image group of four), from left to right is input, GT, InvSeg, Diffusion baseline.
  • Figure 5: Visualization of refined cross-attention maps derived from text prompts before (top) and after (bottom) prompt inversion. Before prompt inversion, the segmentation of background elements such as "grass" or "trees" is influenced by foreground objects like "cow" or "horse", resulting in mistakenly ignoring background classes or segmenting foreground (and background) classes. After applying prompt inversion, this phenomenon is suppressed by improving the distinction between foreground and background through proposed Contrastive Soft Clustering.