Table of Contents
Fetching ...

Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation

Jialei Chen, Daisuke Deguchi, Chenkai Zhang, Hiroshi Murase

TL;DR

The paper tackles zero-shot panoptic segmentation by introducing CONCAT, a two-stage framework that fuses projection-based semantic alignment with generation-based pseudo-query synthesis. Conditional Token Alignment (CON) creates a tight vision-semantic bridge using CLIP CLS tokens from both full and masked images, while Cycle Transition (CAT) learns a high-quality pseudo unseen query generator through semantic-vision and vision-semantic training with CVAE, GMMN, and Query Contrast. Union-finetuning then adapts the semantic projector to unseen categories by integrating real and pseudo queries. Across COCO and open-vocabulary benchmarks, CONCAT achieves state-of-the-art gains in ZPS with favorable speed, and demonstrates robust generalization to unseen categories in open-vocabulary segmentation, illustrating practical impact for scalable, zero-shot scene understanding.

Abstract

Zero-shot Panoptic Segmentation (ZPS) aims to recognize foreground instances and background stuff without images containing unseen categories in training. Due to the visual data sparsity and the difficulty of generalizing from seen to unseen categories, this task remains challenging. To better generalize to unseen classes, we propose Conditional tOken aligNment and Cycle trAnsiTion (CONCAT), to produce generalizable semantic vision queries. First, a feature extractor is trained by CON to link the vision and semantics for providing target queries. Formally, CON is proposed to align the semantic queries with the CLIP visual CLS token extracted from complete and masked images. To address the lack of unseen categories, a generator is required. However, one of the gaps in synthesizing pseudo vision queries, ie, vision queries for unseen categories, is describing fine-grained visual details through semantic embeddings. Therefore, we approach CAT to train the generator in semantic-vision and vision-semantic manners. In semantic-vision, visual query contrast is proposed to model the high granularity of vision by pulling the pseudo vision queries with the corresponding targets containing segments while pushing those without segments away. To ensure the generated queries retain semantic information, in vision-semantic, the pseudo vision queries are mapped back to semantic and supervised by real semantic embeddings. Experiments on ZPS achieve a 5.2% hPQ increase surpassing SOTA. We also examine inductive ZPS and open-vocabulary semantic segmentation and obtain comparative results while being 2 times faster in testing.

Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation

TL;DR

The paper tackles zero-shot panoptic segmentation by introducing CONCAT, a two-stage framework that fuses projection-based semantic alignment with generation-based pseudo-query synthesis. Conditional Token Alignment (CON) creates a tight vision-semantic bridge using CLIP CLS tokens from both full and masked images, while Cycle Transition (CAT) learns a high-quality pseudo unseen query generator through semantic-vision and vision-semantic training with CVAE, GMMN, and Query Contrast. Union-finetuning then adapts the semantic projector to unseen categories by integrating real and pseudo queries. Across COCO and open-vocabulary benchmarks, CONCAT achieves state-of-the-art gains in ZPS with favorable speed, and demonstrates robust generalization to unseen categories in open-vocabulary segmentation, illustrating practical impact for scalable, zero-shot scene understanding.

Abstract

Zero-shot Panoptic Segmentation (ZPS) aims to recognize foreground instances and background stuff without images containing unseen categories in training. Due to the visual data sparsity and the difficulty of generalizing from seen to unseen categories, this task remains challenging. To better generalize to unseen classes, we propose Conditional tOken aligNment and Cycle trAnsiTion (CONCAT), to produce generalizable semantic vision queries. First, a feature extractor is trained by CON to link the vision and semantics for providing target queries. Formally, CON is proposed to align the semantic queries with the CLIP visual CLS token extracted from complete and masked images. To address the lack of unseen categories, a generator is required. However, one of the gaps in synthesizing pseudo vision queries, ie, vision queries for unseen categories, is describing fine-grained visual details through semantic embeddings. Therefore, we approach CAT to train the generator in semantic-vision and vision-semantic manners. In semantic-vision, visual query contrast is proposed to model the high granularity of vision by pulling the pseudo vision queries with the corresponding targets containing segments while pushing those without segments away. To ensure the generated queries retain semantic information, in vision-semantic, the pseudo vision queries are mapped back to semantic and supervised by real semantic embeddings. Experiments on ZPS achieve a 5.2% hPQ increase surpassing SOTA. We also examine inductive ZPS and open-vocabulary semantic segmentation and obtain comparative results while being 2 times faster in testing.
Paper Structure (14 sections, 15 equations, 8 figures, 5 tables)

This paper contains 14 sections, 15 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Generation-based methods.
  • Figure 2: Projection-based methods.
  • Figure 4: Training process.
  • Figure 5: CON overview.
  • Figure 6: CAT overview where $y_{\hat{\sigma}(\textbf{V}^{\varnothing})} = \varnothing$.
  • ...and 3 more figures