InvSeg: Test-Time Prompt Inversion for Semantic Segmentation
Jiayi Lin, Jiabo Huang, Jian Hu, Shaogang Gong
TL;DR
InvSeg addresses open-vocabulary semantic segmentation by closing the distribution gap between rich, image-specific generation prompts and isolated class names. It inverts image context into the diffusion text embedding space at test time, guided by Contrastive Soft Clustering to produce region-level, structure-consistent masks, and stabilizes learning with entropy minimization. The approach yields state-of-the-art results on VOC and PASCAL Context and competitive performance on COCO Object without requiring per-pixel labels, highlighting strong cross-modal alignment and unsupervised region inversion capabilities. This work demonstrates that image-specific prompts learned at test time can substantially improve diffusion-model-based segmentation and unlock robust open-set performance for diverse visual concepts.
Abstract
Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.
