Table of Contents
Fetching ...

Conceptual Codebook Learning for Vision-Language Models

Yi Zhang, Ke Yu, Siqi Wu, Zhihai He

TL;DR

CoCoLe introduces a learnable conceptual codebook that maps visual concepts (keys) to conceptual prompts (values) to bridge image and text representations in vision-language models. By selecting Top-$K_3$ concepts per image and regularizing with a handcrafted concept cache, CoCoLe achieves enhanced cross-domain generalization in few-shot settings, outperforming existing methods across base-to-novel generalization, cross-dataset transfer, and domain generalization. Extensive ablations confirm the importance of each loss term and the chosen hyperparameters, while visualizations demonstrate that prompts capture diverse, high-level concepts. The approach offers a practical, robust avenue for improving VLM generalization with controlled computational overhead.

Abstract

In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.

Conceptual Codebook Learning for Vision-Language Models

TL;DR

CoCoLe introduces a learnable conceptual codebook that maps visual concepts (keys) to conceptual prompts (values) to bridge image and text representations in vision-language models. By selecting Top- concepts per image and regularizing with a handcrafted concept cache, CoCoLe achieves enhanced cross-domain generalization in few-shot settings, outperforming existing methods across base-to-novel generalization, cross-dataset transfer, and domain generalization. Extensive ablations confirm the importance of each loss term and the chosen hyperparameters, while visualizations demonstrate that prompts capture diverse, high-level concepts. The approach offers a practical, robust avenue for improving VLM generalization with controlled computational overhead.

Abstract

In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.
Paper Structure (30 sections, 6 equations, 5 figures, 6 tables)

This paper contains 30 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) Visualization of the chosen prompts of the same image. (b) Visualization of the same prompts on different images. Grad-CAM selvaraju2017grad is used for the visualization.
  • Figure 2: Illustrations and accuracy comparisons on base-to-novel generalization, cross-dataset transfer and domain generalization tasks. S and T represent source and target datasets respectively.
  • Figure 3: An overview of the proposed CoCoLe. (a) shows the establishing process of handcrafted concept cache. (b) displays the handcrafted concept-based prompting process. (c) presents the training pipeline for CoCoLe. Within CoCoLe, only the keys and values in the Conceptual Codebook are learnable.
  • Figure 4: Examples of text concepts from established visual concept datasets, including descriptive terms of texture, color, transparency, motion and brightness.
  • Figure 4: The ablation study on each component of CoCoLe.