Table of Contents
Fetching ...

A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, Tieniu Tan

TL;DR

The paper tackles the resource-intensive nature of adapting CLIP to downstream tasks by introducing a training-free adaptation based on Gaussian Discriminant Analysis (GDA) of CLIP features. By modeling each class as Gaussian with a shared covariance, Bayes' rule yields a linear classifier whose parameters are estimated directly from data, and this GDA classifier is ensembled with CLIP's zero-shot weights to leverage both modalities. It further extends the approach to base-to-new generalization via KNN-based data synthesis and to unsupervised learning via EM for Gaussian mixtures, achieving competitive or superior results compared to training-based methods across 17 datasets. The method demonstrates strong few-shot, imbalanced, and out-of-distribution performance, and the authors provide code to enable reproducibility and broader adoption.

Abstract

Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP's performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with limited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP. Typically, GDA assumes that features of each class follow Gaussian distributions with identical covariance. By leveraging Bayes' formula, the classifier can be expressed in terms of the class means and covariance, which can be estimated from the data without the need for training. To integrate knowledge from both visual and textual modalities, we ensemble it with the original zero-shot classifier within CLIP. Extensive results on 17 datasets validate that our method surpasses or achieves comparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extend our method to base-to-new generalization and unsupervised learning, once again demonstrating its superiority over competing approaches. Our code is publicly available at \url{https://github.com/mrflogs/ICLR24}.

A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

TL;DR

The paper tackles the resource-intensive nature of adapting CLIP to downstream tasks by introducing a training-free adaptation based on Gaussian Discriminant Analysis (GDA) of CLIP features. By modeling each class as Gaussian with a shared covariance, Bayes' rule yields a linear classifier whose parameters are estimated directly from data, and this GDA classifier is ensembled with CLIP's zero-shot weights to leverage both modalities. It further extends the approach to base-to-new generalization via KNN-based data synthesis and to unsupervised learning via EM for Gaussian mixtures, achieving competitive or superior results compared to training-based methods across 17 datasets. The method demonstrates strong few-shot, imbalanced, and out-of-distribution performance, and the authors provide code to enable reproducibility and broader adoption.

Abstract

Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP's performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with limited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP. Typically, GDA assumes that features of each class follow Gaussian distributions with identical covariance. By leveraging Bayes' formula, the classifier can be expressed in terms of the class means and covariance, which can be estimated from the data without the need for training. To integrate knowledge from both visual and textual modalities, we ensemble it with the original zero-shot classifier within CLIP. Extensive results on 17 datasets validate that our method surpasses or achieves comparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extend our method to base-to-new generalization and unsupervised learning, once again demonstrating its superiority over competing approaches. Our code is publicly available at \url{https://github.com/mrflogs/ICLR24}.
Paper Structure (24 sections, 1 theorem, 11 equations, 5 figures, 25 tables, 1 algorithm)

This paper contains 24 sections, 1 theorem, 11 equations, 5 figures, 25 tables, 1 algorithm.

Key Result

Theorem A.1

Assuming that the features of different classes follow the Gaussian distribution with identical covariance, i.e., $(X|Y=i)\sim \mathcal{N}(\mu_i, \Sigma)$ for $i=1,2,..,K$. Then, the classification probability can be expressed as follows:

Figures (5)

  • Figure 1: The overview of our training-free method. In our method, we begin by extracting visual features from the training dataset using the CLIP visual encoder. Next, we compute the mean vectors for each class and the shared precision matrix (inverse covariance) using Eq. (\ref{['eq:precisionmatrix']}). Through the Gaussian Discriminate Analysis (GDA), the weight and bias of the classifier can be expressed in terms of the mean vectors and the precision matrix, which can be derived from Eq. (\ref{['eq:solution']}) (the red formula in the figure). Finally, we enhance our method by ensembling the GDA classifier and the CLIP's zero-shot classifier, integrating the knowledge from visual and textual modalities.
  • Figure 2: Results of few-shot classification on the 11 datasets. We evaluate the performance of our proposed method against five training-free methods under 1, 2, 4, 8, and 16-shot settings. The models are trained using ResNet-50 CLIP. Our method outperforms the baselines significantly.
  • Figure 2: Out-of-distribution Generalization.
  • Figure 3: We trained our method on ImageNet with more shots. The x-axis is presented on a logarithmic scale.
  • Figure 4: Average results over 11 datasets on base-to-new generalization.

Theorems & Definitions (2)

  • Theorem A.1
  • proof