Table of Contents
Fetching ...

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, Zhihai He

TL;DR

The paper addresses the limited generalization of CLIP-based fine-tuning on fine-grained and cross-domain tasks. It introduces Concept-Guided Prompt Learning (CPL), which builds a visual concept cache and uses a transformer-based projector to map multi-level visual features into the text space, complemented by a task adapter to preserve pre-trained knowledge while learning task-specific cues. CPL achieves state-of-the-art results across base-to-novel generalization, cross-dataset transfer, and domain generalization, with ablations showing each component materially contributes to performance and efficiency. This approach enables more consistent visual-language alignment by leveraging transferable low-level visual concepts, offering practical improvements for open-vocabulary vision-language tasks. The method is computationally efficient and shows strong generalization across diverse datasets, signaling meaningful impact for real-world VLM deployment.

Abstract

Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose Concept-Guided Prompt Learning (CPL) for vision-language models. Specifically, we leverage the well-learned knowledge of CLIP to create a visual concept cache to enable concept-guided prompting. In order to refine the text features, we further develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to the current state-of-the-art methods.

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

TL;DR

The paper addresses the limited generalization of CLIP-based fine-tuning on fine-grained and cross-domain tasks. It introduces Concept-Guided Prompt Learning (CPL), which builds a visual concept cache and uses a transformer-based projector to map multi-level visual features into the text space, complemented by a task adapter to preserve pre-trained knowledge while learning task-specific cues. CPL achieves state-of-the-art results across base-to-novel generalization, cross-dataset transfer, and domain generalization, with ablations showing each component materially contributes to performance and efficiency. This approach enables more consistent visual-language alignment by leveraging transferable low-level visual concepts, offering practical improvements for open-vocabulary vision-language tasks. The method is computationally efficient and shows strong generalization across diverse datasets, signaling meaningful impact for real-world VLM deployment.

Abstract

Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose Concept-Guided Prompt Learning (CPL) for vision-language models. Specifically, we leverage the well-learned knowledge of CLIP to create a visual concept cache to enable concept-guided prompting. In order to refine the text features, we further develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to the current state-of-the-art methods.
Paper Structure (34 sections, 5 equations, 4 figures, 6 tables)

This paper contains 34 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Examples and performance comparisons on base-to-novel generalization and cross-dataset transfer tasks. Our proposed CPL exhibits remarkable generalization capabilities in comparison to other state-of-the-art methods.
  • Figure 2: An illustration comparing our proposed proposed CPL approach with related baselines. We include CoOp zhou2022learning, CoCoOp zhou2022conditional and TaskRes yu2022task for comparison.
  • Figure 3: An overview of our proposed Concept-Guided Prompt Learning (CPL) method. Subfigure (a) shows the visual concept cache-establishing process. Subfigure (b) shows the concept-guided prompt discovery process. Subfigure (c) presents the training pipeline of our proposed CPL, where the projector and task adapter are learnable.
  • Figure 4: Example text concepts collected from existing visual attribute datasets. Here we present several instances of terms that illustrate color, material, size, and shape within our dictionary of text concepts.