Table of Contents
Fetching ...

Diversity Covariance-Aware Prompt Learning for Vision-Language Models

Songlin Dong, Zhengdong Zhou, Chenhao Ding, Xinyuan Gao, Alex Kot, Yihong Gong

TL;DR

The paper addresses the challenge of adapting large vision-language models to few-shot tasks by recognizing that feature distributions become non-isotropic with limited data. It introduces the Diversity Covariance-Aware (DCA) framework, combining covariance-aware modeling with anisotropic Mahalanobis distance and diversity-aware prompts to capture multi-faceted category attributes. The authors derive a theoretically grounded classifier, implement covariance shrinkage for stability, and optimize multiple promiscuous prompts with text separation to improve generalization. Across 11 datasets and in domain-generalization scenarios, DCA yields substantial performance gains over zero-shot CLIP and contemporary prompt-tuning methods, highlighting the practical value of distribution-aware prompt learning for real-world few-shot applications.

Abstract

Prompt tuning can further enhance the performance of visual-language models across various downstream tasks (e.g., few-shot learning), enabling them to better adapt to specific applications and needs. In this paper, we present a Diversity Covariance-Aware framework that learns distributional information from the data to enhance the few-shot ability of the prompt model. First, we propose a covariance-aware method that models the covariance relationships between visual features and uses anisotropic Mahalanobis distance, instead of the suboptimal cosine distance, to measure the similarity between two modalities. We rigorously derive and prove the validity of this modeling process. Then, we propose the diversity-aware method, which learns multiple diverse soft prompts to capture different attributes of categories and aligns them independently with visual modalities. This method achieves multi-centered covariance modeling, leading to more diverse decision boundaries. Extensive experiments on 11 datasets in various tasks demonstrate the effectiveness of our method.

Diversity Covariance-Aware Prompt Learning for Vision-Language Models

TL;DR

The paper addresses the challenge of adapting large vision-language models to few-shot tasks by recognizing that feature distributions become non-isotropic with limited data. It introduces the Diversity Covariance-Aware (DCA) framework, combining covariance-aware modeling with anisotropic Mahalanobis distance and diversity-aware prompts to capture multi-faceted category attributes. The authors derive a theoretically grounded classifier, implement covariance shrinkage for stability, and optimize multiple promiscuous prompts with text separation to improve generalization. Across 11 datasets and in domain-generalization scenarios, DCA yields substantial performance gains over zero-shot CLIP and contemporary prompt-tuning methods, highlighting the practical value of distribution-aware prompt learning for real-world few-shot applications.

Abstract

Prompt tuning can further enhance the performance of visual-language models across various downstream tasks (e.g., few-shot learning), enabling them to better adapt to specific applications and needs. In this paper, we present a Diversity Covariance-Aware framework that learns distributional information from the data to enhance the few-shot ability of the prompt model. First, we propose a covariance-aware method that models the covariance relationships between visual features and uses anisotropic Mahalanobis distance, instead of the suboptimal cosine distance, to measure the similarity between two modalities. We rigorously derive and prove the validity of this modeling process. Then, we propose the diversity-aware method, which learns multiple diverse soft prompts to capture different attributes of categories and aligns them independently with visual modalities. This method achieves multi-centered covariance modeling, leading to more diverse decision boundaries. Extensive experiments on 11 datasets in various tasks demonstrate the effectiveness of our method.

Paper Structure

This paper contains 20 sections, 19 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of feature space distribution and performance comparison. (a) When training data is abundant, DNN learns a good isotropic spherical feature space ncm1, and thus, isotropic metrics (such as cosine or Euclidean distance) can be effectively applied. (b) However, in few-shot tasks, isotropic feature space becomes challenging kumar2022gdcgoswami2024fecam, and isotropic metrics can lead to misclassified regions (as shown in the green area). (c) We propose the CA method, which extracts the distribution information of the data through covariance modeling and uses anisotropic Mahalanobis distance to measure the distance. (d) The DA method is designed to combat overfitting and capture different attributes of categories, maintaining more diverse decision boundaries. (e) The DCA method surpasses state-of-the-art methods on 11 diverse datasets.
  • Figure 2: Overview of the architecture of DCA. DCA first uses multiple prompts to describe each class, generating multiple sets of text features through a text encoder. The images are also encoded into a set of visual features. During training, we independently compute the $\mathcal{L}_{cls}$ between multiple sets of text features and visual features. Meanwhile, we model the covariance relationships between visual features and treat the text features as their average vector, enabling the use of Mahalanobis distance to measure the similarity between text and visual feature modalities during testing. Moreover, we introduce $\mathcal{L}_{ts}$ and improve $\hat{\mathcal{L}}_{in}$ to optimize text and visual feature spaces. The detailed training and testing algorithms are provided in the appendix.
  • Figure 3: Comparison results (%) of few-shot learning on 11 datasets. We compared our method with several SOTA prompt learning methods and consistently demonstrated significant performance improvements across most datasets. More detailed comparison results and variance of the DCA method are provided in the appendix.
  • Figure 4: (a,b) The influence of text prompt number $M$. (c,d) Sensitivity analysis of hyper-parameters.
  • Figure 5: Impact of covariance shrinkage parameter $\gamma _1$ and $\gamma _2$.
  • ...and 1 more figures