ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

Yassir Bendou; Amine Ouasfi; Vincent Gripon; Adnane Boukhayma

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma

TL;DR

This work addresses few-shot adaptation for large vision-language models like CLIP by reinterpreting caching-based Tip-Adapter as a kernel method and introducing ProKeR, a training-free approach with a proximal RKHS regularizer to inject global information. It derives a closed-form solution for ProKeR, analyzes local NW and LLR variants, and shows how global regularization improves robustness and accuracy across 11 datasets, including distribution shifts. A training-based extension, ProKeR+CLAP, further demonstrates the value of joint optimization with a strong base learner. Memory-efficient kernels via Mercer decomposition and Random Fourier Features are explored, and extensive experiments establish ProKeR as a new state-of-the-art in training-free few-shot adaptation with competitive performance against training-based methods. Overall, the global-regression perspective significantly enhances stability and generalization while maintaining the lightweight, training-free advantages of caching approaches.

Abstract

The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

TL;DR

Abstract

Paper Structure (30 sections, 21 equations, 4 figures, 8 tables)

This paper contains 30 sections, 21 equations, 4 figures, 8 tables.

Introduction
Related Work
Vision-Language Pre-trained Models
Few-shot Adaptation
Method
Tip-Adapter as a Nadaraya-Watson estimator
Training-free few-shot adapters as a Bayes optimal mapping
Local Linear Regression
Local methods with a global metric
Proximal Kernel Ridge Regression
Mercer decomposition of kernel methods
Training-based ProKeR
Experiments
Datasets and Evaluation Protocol
Experiment Results and Analysis
...and 15 more sections

Figures (4)

Figure 1: Fitting comparison between different methods on synthetically generated data, illustrating Nadaraya-Watson (Tip-Adapter) bias mitigation via our proposed Local Linear Regression (LLR, Sec. \ref{['sec:llr']}) and our final method ProKeR.
Figure 2: Overview of our training-free method ProKeR. While Tip-Adapter builds a key-value cache model using the few-shot samples, ProKeR incorporates a proximal global regularization based on the zero-shot predictor in a reproducing kernel Hilbert space (RKHS). This allows the use of a richer model without overfitting on the few-shot data.
Figure 3: Average performance for different methods on 11 image classification datasets.
Figure 4: Few-shot Performance of Training-free Methods on 11 image classification datasets (CoOp's benchmark).

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

TL;DR

Abstract

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)