Table of Contents
Fetching ...

Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Kun Ding, Qiang Yu, Haojian Zhang, Gaofeng Meng, Shiming Xiang

TL;DR

Three calibration modules aimed at addressing the gap between pre-training and adaptation of vision-language models and the high complexity of GPs are presented and both training-free and training-required variants are proposed.

Abstract

Cache-based approaches stand out as both effective and efficient for adapting vision-language models (VLMs). Nonetheless, the existing cache model overlooks three crucial aspects. 1) Pre-trained VLMs are mainly optimized for image-text similarity, neglecting the importance of image-image similarity, leading to a gap between pre-training and adaptation. 2) The current cache model is based on the Nadaraya-Watson (N-W) estimator, which disregards the intricate relationships among training samples while constructing weight function. 3) Under the condition of limited samples, the logits generated by cache model are of high uncertainty, directly using these logits without accounting for the confidence could be problematic. This work presents three calibration modules aimed at addressing the above challenges. Similarity Calibration refines the image-image similarity by using unlabeled images. We add a learnable projection layer with residual connection on top of the pre-trained image encoder of CLIP and optimize the parameters by minimizing self-supervised contrastive loss. Weight Calibration introduces a precision matrix into the weight function to adequately model the relation between training samples, transforming the existing cache model to a Gaussian Process (GP) regressor, which could be more accurate than N-W estimator. Confidence Calibration leverages the predictive variances computed by GP Regression to dynamically re-scale the logits of cache model, ensuring that the cache model's outputs are appropriately adjusted based on their confidence levels. Besides, to reduce the high complexity of GPs, we further propose a group-based learning strategy. Integrating the above designs, we propose both training-free and training-required variants. Extensive experiments on 11 few-shot classification datasets validate that the proposed methods can achieve state-of-the-art performance.

Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

TL;DR

Three calibration modules aimed at addressing the gap between pre-training and adaptation of vision-language models and the high complexity of GPs are presented and both training-free and training-required variants are proposed.

Abstract

Cache-based approaches stand out as both effective and efficient for adapting vision-language models (VLMs). Nonetheless, the existing cache model overlooks three crucial aspects. 1) Pre-trained VLMs are mainly optimized for image-text similarity, neglecting the importance of image-image similarity, leading to a gap between pre-training and adaptation. 2) The current cache model is based on the Nadaraya-Watson (N-W) estimator, which disregards the intricate relationships among training samples while constructing weight function. 3) Under the condition of limited samples, the logits generated by cache model are of high uncertainty, directly using these logits without accounting for the confidence could be problematic. This work presents three calibration modules aimed at addressing the above challenges. Similarity Calibration refines the image-image similarity by using unlabeled images. We add a learnable projection layer with residual connection on top of the pre-trained image encoder of CLIP and optimize the parameters by minimizing self-supervised contrastive loss. Weight Calibration introduces a precision matrix into the weight function to adequately model the relation between training samples, transforming the existing cache model to a Gaussian Process (GP) regressor, which could be more accurate than N-W estimator. Confidence Calibration leverages the predictive variances computed by GP Regression to dynamically re-scale the logits of cache model, ensuring that the cache model's outputs are appropriately adjusted based on their confidence levels. Besides, to reduce the high complexity of GPs, we further propose a group-based learning strategy. Integrating the above designs, we propose both training-free and training-required variants. Extensive experiments on 11 few-shot classification datasets validate that the proposed methods can achieve state-of-the-art performance.

Paper Structure

This paper contains 19 sections, 21 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Feature visualization on the test set of EuroSAT dataset without (a) and with (b) similarity calibration.
  • Figure 2: Visualization of weight function. The Gaussian kernel $\exp(-\beta x^2)$ with $\beta=100$ is adopted. For GP, 200 uniformly spaced 1D data points between [0,1] are used. The noise variance of GP in (a) and (b) is $\sigma^2=1.0$ and $\sigma^2=100.0$, respectively.
  • Figure 3: Overview of the proposed adaptation method GPCache for vision-language models (VLMs). The key ingredients of the proposed method are similarity calibration, confidence calibration and weight calibration.
  • Figure 4: Illustration of computing contrastive loss in the similarity calibration stage. $\mathbf{f}_1,\cdots, \mathbf{f}_b$ are extracted features of original images, $\mathbf{g}_1,\cdots, \mathbf{g}_b$ are extracted features of augmented images.
  • Figure 5: Comparison of training-free methods under the few-shot classification setting on the 11 datasets.
  • ...and 3 more figures