Table of Contents
Fetching ...

DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction

Jing Wang, Huimin Shi, Quan Zhou, Qibo Liu, Suofei Zhang, Huimin Lu

Abstract

The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.

DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction

Abstract

The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.
Paper Structure (16 sections, 17 equations, 7 figures, 7 tables)

This paper contains 16 sections, 17 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of efficiency and open-vocabulary semantic segmentation performance among CAT-Seg (blue), SED (purple), and our method (green). The left side shows efficiency metrics, and the right side shows segmentation accuracy. GFLOPs and inference time are measured on a single NVIDIA 3090 GPU. Our method achieves superior segmentation accuracy with balanced efficiency.
  • Figure 2: Illustration of two representative OVSS methods and our approach. (a) Proposal mask-based method. (b) Cost volume-based method. (c) Our DCP-CLIP, which adopts a coarse-to-fine framework and introduces a dual interaction mechanism to explicitly enhance semantic alignment between textual and visual spaces.
  • Figure 3: Overview of our proposed DCP-CLIP. The Dynamic Category Selection module adaptively selects relevant categories based on the image content. Next, we perform cross-modal semantic learning under the guidance of the selected categories, followed by fine-segmentation that restores spatial details and resolution through a Spatial Enhanced Decoder. During training, the fine segmentation output is used for supervision. During inference, Tag Validation module leverages spatial context to refine open-vocabulary predictions.
  • Figure 4: The detailed architecture of text-guided alignment. High-level semantics from text are injected via image features as a bridge, and then fused with initial cost volume features through self-attention to produce more expressive representations.
  • Figure 5: The detailed architecture of Spatial Enhanced Decoder. Spatial enhancement uses a Swin Transformer to model local context with shallow features and attention maps. Deconvolution and multi-source fusion produce refined semantic maps.
  • ...and 2 more figures