Table of Contents
Fetching ...

Tuning-free Universally-Supervised Semantic Segmentation

Xiaobo Yang, Xiaojin Gong

TL;DR

The paper tackles tuning-free semantic segmentation across supervision types by leveraging SAM-derived masks and a frozen CLIP backbone. It introduces DBA-CLIP to align mask embeddings with text embeddings, addressing CLIP's zero-shot misalignment, and couples this with GLCC, a global-local consistent classifier that robustly handles noisy pseudo-labels. The method combines a linear probe with inductive label propagation and mutual bootstrapping to exploit high-quality embeddings while suppressing noise, achieving efficient, tuning-free results. Extensive experiments across VOC, COCO, and Cityscapes demonstrate state-of-the-art or competitive performance for FSSS, SSS, WSSS, and OVSS without fine-tuning or post-processing, highlighting practical potential for universal supervision.

Abstract

This work presents a tuning-free semantic segmentation framework based on classifying SAM masks by CLIP, which is universally applicable to various types of supervision. Initially, we utilize CLIP's zero-shot classification ability to generate pseudo-labels or perform open-vocabulary segmentation. However, the misalignment between mask and CLIP text embeddings leads to suboptimal results. To address this issue, we propose discrimination-bias aligned CLIP to closely align mask and text embedding, offering an overhead-free performance gain. We then construct a global-local consistent classifier to classify SAM masks, which reveals the intrinsic structure of high-quality embeddings produced by DBA-CLIP and demonstrates robustness against noisy pseudo-labels. Extensive experiments validate the efficiency and effectiveness of our method, and we achieve state-of-the-art (SOTA) or competitive performance across various datasets and supervision types.

Tuning-free Universally-Supervised Semantic Segmentation

TL;DR

The paper tackles tuning-free semantic segmentation across supervision types by leveraging SAM-derived masks and a frozen CLIP backbone. It introduces DBA-CLIP to align mask embeddings with text embeddings, addressing CLIP's zero-shot misalignment, and couples this with GLCC, a global-local consistent classifier that robustly handles noisy pseudo-labels. The method combines a linear probe with inductive label propagation and mutual bootstrapping to exploit high-quality embeddings while suppressing noise, achieving efficient, tuning-free results. Extensive experiments across VOC, COCO, and Cityscapes demonstrate state-of-the-art or competitive performance for FSSS, SSS, WSSS, and OVSS without fine-tuning or post-processing, highlighting practical potential for universal supervision.

Abstract

This work presents a tuning-free semantic segmentation framework based on classifying SAM masks by CLIP, which is universally applicable to various types of supervision. Initially, we utilize CLIP's zero-shot classification ability to generate pseudo-labels or perform open-vocabulary segmentation. However, the misalignment between mask and CLIP text embeddings leads to suboptimal results. To address this issue, we propose discrimination-bias aligned CLIP to closely align mask and text embedding, offering an overhead-free performance gain. We then construct a global-local consistent classifier to classify SAM masks, which reveals the intrinsic structure of high-quality embeddings produced by DBA-CLIP and demonstrates robustness against noisy pseudo-labels. Extensive experiments validate the efficiency and effectiveness of our method, and we achieve state-of-the-art (SOTA) or competitive performance across various datasets and supervision types.
Paper Structure (29 sections, 18 equations, 5 figures, 12 tables)

This paper contains 29 sections, 18 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Method overview. The class and confidence of pseudo-labels are indicated by the color and size of the circles. DBA-CLIP is forwarded with awareness of SAM masks. Mask-pooling the DBA-CLIP output produces highly text-aligned mask embeddings, which, along with text embeddings and supervision, are used to construct GLCC. The construction involves alternating between training a linear probe and performing transductive label propagation. During inference, the GLCC classifies masks from test images to produce segmentation results.
  • Figure 2: DBA-CLIP Illustration. (a) We generate attention bias based on masks to constrain affinity computation within the mask. (b) Affinities of the same mask are averaged to produce $\bar{A}_m$, which should align with the discrimination bias on the $L_y$. (c) However, the low-resolution $\bar{A}_m$ fails to match the contour of small and irregular objects, which leads us to adopt an approximate version that avoids directly using $\bar{A}_m$ to weighted-pool MaskCLIP output. See more examples in \ref{['fig:More-DBA']}.
  • Figure 7: Performance comparison of the GLCC and its variants on the PASCAL VOC 2012 val set. "LP" indicates linear probe and "IL" indicates inductive label propagation. "$\to$" indicates using one classifier to bootstrap another one.
  • Figure S1: More examples of discrimination-bias alignment. DB tends to overlook misleading areas like text and patterns, ambiguous regions such as faces, and indistinct zones like large solid colors. Instead, it focuses on the most discriminative regions. Average affinity aligns well with DB on large masks but becomes meaningless on small and irregular masks.
  • Figure S2: Illustration of segmentation results generated by the proposed method and its variants.