ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma
TL;DR
This work addresses few-shot adaptation for large vision-language models like CLIP by reinterpreting caching-based Tip-Adapter as a kernel method and introducing ProKeR, a training-free approach with a proximal RKHS regularizer to inject global information. It derives a closed-form solution for ProKeR, analyzes local NW and LLR variants, and shows how global regularization improves robustness and accuracy across 11 datasets, including distribution shifts. A training-based extension, ProKeR+CLAP, further demonstrates the value of joint optimization with a strong base learner. Memory-efficient kernels via Mercer decomposition and Random Fourier Features are explored, and extensive experiments establish ProKeR as a new state-of-the-art in training-free few-shot adaptation with competitive performance against training-based methods. Overall, the global-regression perspective significantly enhances stability and generalization while maintaining the lightweight, training-free advantages of caching approaches.
Abstract
The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.
