Table of Contents
Fetching ...

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang

TL;DR

This paper studies open-vocabulary segmentation through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP and presents a Semantic-assisted CAlibration Network (SCAN), which achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks.

Abstract

This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress, performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end, we present a Semantic-assisted CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides, a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of existing evaluation system that ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

TL;DR

This paper studies open-vocabulary segmentation through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP and presents a Semantic-assisted CAlibration Network (SCAN), which achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks.

Abstract

This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress, performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end, we present a Semantic-assisted CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides, a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of existing evaluation system that ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).
Paper Structure (20 sections, 6 equations, 6 figures, 8 tables)

This paper contains 20 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of existing two-stage methods and our SCAN. Limited by domain-biased CLIP classification and in-vocabulary model classification, existing methods struggle to align visual content with unbounded text. By incorporating generalized semantic guidance of CLIP to proposal embedding and perform contextual shift, our SCAN achieves excellent OVS performance.
  • Figure 2: Pipeline of SCAN. Firstly, a segmentation model is used to generate class-agnostic masks and corresponding proposal embeddings for cross-modal alignment. To avoid collapse into known categories, the proposal embeddings are calibrated by integrating global semantic prior of CLIP in Semantic Integration Module. Besides, the cropped and masked images are input to Contextual Shifted CLIP for domain-adapted classification. Finally, the matching scores of both model embeddings and CLIP are combined to assign category labels.
  • Figure 3: Illustration of image domain bias and corresponding detriment to vision-language alignment. The right side shows the classification confidence for masked images. "Ori.CLIP" and "CS.CLIP" demonstrate the original CLIP and our contextual shifted CLIP, respectively.
  • Figure 4: Process of applying contextual shift strategy.
  • Figure 5: Explanation of potential problems exist in the current evaluation system. There exists severe semantic duplication, i.e., synonyms and parent categories, in benchmarks, while current metric does not take the semantic relationships between different categories into account.
  • ...and 1 more figures