Table of Contents
Fetching ...

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Jingyao Li, Pengguang Chen, Shengju Qian, Shu Liu, Jiaya Jia

TL;DR

This work introduces a trusty token that enables distinguishing novel classes from known ones in prediction in CLIP, and shows that TagCLIP improves the Intersection over Union of unseen classes by 7.4%, 1.7% and 2.1%, respectively, with negligible overheads.

Abstract

Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012, COCO-Stuff 164K and PASCAL Context. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4%, 1.7% and 2.1%, respectively, with negligible overheads. The code is available at https://github.com/dvlab-research/TagCLIP.

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

TL;DR

This work introduces a trusty token that enables distinguishing novel classes from known ones in prediction in CLIP, and shows that TagCLIP improves the Intersection over Union of unseen classes by 7.4%, 1.7% and 2.1%, respectively, with negligible overheads.

Abstract

Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012, COCO-Stuff 164K and PASCAL Context. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4%, 1.7% and 2.1%, respectively, with negligible overheads. The code is available at https://github.com/dvlab-research/TagCLIP.
Paper Structure (30 sections, 15 equations, 8 figures, 10 tables)

This paper contains 30 sections, 15 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Visualization of segmentation results on COCO-Stuff 164K. Four columns from left to right represent (a) original testing images; (b) results of current SOTA zegclip; (c) results of TagCLIP; (d) ground truth. The tags with borders in black and red denote seen and unseen classes separately.
  • Figure 2: Left: The framework of our TagCLIP. First, we input images and text prompts into CLIP and concatenate a learnable trusty token with CLIP's text tokens. Then, we match the concatenated tokens with image tokens and input the output into our proposed Trusty Learner. Next, we perform a segmentor segvit to generate two maps: the trusty map and the raw mask. $\otimes$ and $\oplus$ represent Hadamard product and concatenation. Right: The detailed structure of Trusty Learner and Segmentor. The Trusty Learner contains a linear projection, a multi-head attention block with a shortcut, and a normalization layer. The segmentor contains three layers. Each layer constitutes an Attention-to-Mask block segvit and a linear projection, both with shortcuts and normalization layers.
  • Figure 3: Left: During training, we propose a binary mask $\mathbf{G}_A$ for the supervision of $\mathbf{M}_A$ and utilize the ground truth $\mathbf{G}_R$ for the supervision of $\mathbf{M}_R$. Right: During inference, the raw semantic segmentation $\mathbf{M}_R$ is weighted by $\mathbf{M}_A$ to generate the final mask $\mathbf{M}$.
  • Figure 4: Visualization of the Trust Map. It exhibits a focused emphasis on seen classes (left 2 columns), namely airplane, bicycle, boat, car, cow, motorbike, dining table, bottle, etc. Concurrently, it effectively disregards unseen categories (right 1 column) such as potted plant, sheep, tv monitor, sofa, etc.
  • Figure 5: Examples of two Trusty Learner structures. Left concatenates $\mathbf{T'}\odot \mathbf{H}$ and $\mathbf{T'}$ before inputting them into the Multi-head attention. Right first inputs $\mathbf{T'}\odot \mathbf{H}$ into the Multi-head attention and then concatenates it with $\mathbf{T'}$. $\oplus$: dimension concatenation. $\odot$: per-element Hadamard product. $\mathbf{\hat{T'}}$: defined in \ref{["equ:hatt'"]}.
  • ...and 3 more figures