Table of Contents
Fetching ...

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Yong Xien Chng, Xuchong Qiu, Yizeng Han, Kai Ding, Wan Ding, Gao Huang

TL;DR

This work tackles open-vocabulary segmentation by identifying mask classification as the main bottleneck and proposing a paradigm shift: freeze the mask generator and focus on enhancing mask classification. The authors introduce Fine-grained Semantic Adaptation (FISA), which combines Semantic-guided Visual Encoding (SEVE) to inject fine-grained semantic information into visual features and Strategic Image-Mask Optimization (SIMO) to minimize the number of trainable parameters. Across multiple benchmarks, FISA achieves state-of-the-art results with up to improvements of 1.0 in PQ and 3.0 in mIoU, while reducing training costs by about fivefold compared to prior best methods. The approach preserves the pre-trained knowledge of CLIP while enabling efficient cross-domain adaptation, highlighting the critical role of semantically enriched visual encoding for dense open-vocabulary segmentation.

Abstract

Despite extensive research, open-vocabulary segmentation methods still struggle to generalize across diverse domains. To reduce the computational cost of adapting Vision-Language Models (VLMs) while preserving their pre-trained knowledge, most methods freeze the VLMs for mask classification and train only the mask generator. However, our comprehensive analysis reveals a surprising insight: open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation. This discovery prompts us to rethink the existing paradigm and explore an alternative approach. Instead of freezing the VLM, we propose to freeze the pre-trained mask generator and focus on optimizing the mask classifier. Building on the observation that VLMs pre-trained on global-pooled image-text features often fail to capture fine-grained semantics necessary for effective mask classification, we propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation. FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process. As our method strategically optimizes only a small portion of the VLM's parameters, it enjoys the efficiency of adapting to new data distributions while largely preserving the valuable VLM pre-trained knowledge. Extensive ablation studies confirm the superiority of our approach. Notably, FISA achieves new state-of-the-art results across multiple representative benchmarks, improving performance by up to +1.0 PQ and +3.0 mIoU and reduces training costs by nearly 5x compared to previous best methods. Our code and data will be made public.

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

TL;DR

This work tackles open-vocabulary segmentation by identifying mask classification as the main bottleneck and proposing a paradigm shift: freeze the mask generator and focus on enhancing mask classification. The authors introduce Fine-grained Semantic Adaptation (FISA), which combines Semantic-guided Visual Encoding (SEVE) to inject fine-grained semantic information into visual features and Strategic Image-Mask Optimization (SIMO) to minimize the number of trainable parameters. Across multiple benchmarks, FISA achieves state-of-the-art results with up to improvements of 1.0 in PQ and 3.0 in mIoU, while reducing training costs by about fivefold compared to prior best methods. The approach preserves the pre-trained knowledge of CLIP while enabling efficient cross-domain adaptation, highlighting the critical role of semantically enriched visual encoding for dense open-vocabulary segmentation.

Abstract

Despite extensive research, open-vocabulary segmentation methods still struggle to generalize across diverse domains. To reduce the computational cost of adapting Vision-Language Models (VLMs) while preserving their pre-trained knowledge, most methods freeze the VLMs for mask classification and train only the mask generator. However, our comprehensive analysis reveals a surprising insight: open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation. This discovery prompts us to rethink the existing paradigm and explore an alternative approach. Instead of freezing the VLM, we propose to freeze the pre-trained mask generator and focus on optimizing the mask classifier. Building on the observation that VLMs pre-trained on global-pooled image-text features often fail to capture fine-grained semantics necessary for effective mask classification, we propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation. FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process. As our method strategically optimizes only a small portion of the VLM's parameters, it enjoys the efficiency of adapting to new data distributions while largely preserving the valuable VLM pre-trained knowledge. Extensive ablation studies confirm the superiority of our approach. Notably, FISA achieves new state-of-the-art results across multiple representative benchmarks, improving performance by up to +1.0 PQ and +3.0 mIoU and reduces training costs by nearly 5x compared to previous best methods. Our code and data will be made public.
Paper Structure (19 sections, 3 equations, 10 figures, 4 tables)

This paper contains 19 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: a) Comparison between our proposed Fine-grained Semantic Adaptation (FISA) and previous open vocabulary segmentation paradigm. b) Unlike previous methods that focus on improving mask generation, FISA adopts an alternative approach that focuses on improving mask classification. Specifically, it adopts a frozen pre-trained mask generator and enhances mask classification through two key innovations: i) Semantic-guided Visual Encoding that integrates fine-grained semantic information early in the visual encoding process, and ii) Strategic Image-Mask Optimization that selectively optimizes only a small portion of the VLM's parameters to retain its valuable pre-trained knowledge while endowing it with the flexibility to adapt to new distributions.
  • Figure 2: a) MaskCLIP shows a much greater performance gain with a perfect "oracle" mask classifier than with a perfect "oracle" mask generator, highlighting mask classification as the main performance bottleneck for open-vocabulary segmentation. b) Using a pre-trained mask generator performs as well as one re-trained from scratch, indicating that the mask generator can be frozen to enhance training efficiency without performance loss.
  • Figure 3: The incorporation of fine-grained semantic awareness significantly improves MaskCLIP's performance across many out-of-domain classes in ADE20K. Compared to the baseline MaskCLIP model trained on COCO, this approach substantially improves performance, with gains of up to 13.7 points in mIoU. These results highlight the lack of fine-grained semantics as a key factor influencing performance in open-vocabulary segmentation.
  • Figure 4: Overview of Fine-grained Semantic Adaptation (FISA). Guided by the insight that mask classification is the main performance bottleneck and its weak performance mainly arises from the lack of of fine-grained semantics in the extracted visual features, FISA freezes the mask generator and introduces two key innovations for this task. First, it employs Semantic-guided Visual Encoding to inject semantic-awareness early into the visual feature extraction process. Second, it utilizes Strategic Image-Mask Optimization to efficiently adapt only a small number of CLIP's parameters to new data distributions while preserving its valuable pre-trained knowledge.
  • Figure 5: Qualitative comparison on open-vocabulary semantic segmentation. Unlike MAFT+, our method accurately identifies buildings with uncommon shapes and textures while avoiding false predictions, such as misclassifying objects as rail.
  • ...and 5 more figures