CLIP's Visual Embedding Projector is a Few-shot Cornucopia
Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette
TL;DR
ProLIP presents a remarkably simple yet effective approach to adapting CLIP for few-shot tasks by only fine-tuning the vision embedding projection matrix with a Frobenius-norm regularizer toward the pretrained weights, balancing task adaptation with knowledge preservation. The method achieves state-of-the-art results across 11 few-shot benchmarks, and demonstrates robustness in validation-free settings, cross-dataset transfer, and test-time adaptation. The Regularized Linear Adapter (RLA) variant extends ProLIP principles to a black-box setting, while experiments show strong generalization across domain shifts and base-to-new scenarios. The work also provides extensive hyperparameter analyses, demonstrates speed advantages via precomputed features, and suggests a general framework for regularized, parameter-efficient adaptation of multimodal foundation models. Overall, ProLIP offers a practical, fast, and scalable path for deploying CLIP-like models in data-scarce and transfer-heavy contexts, with code and supplementary material supporting broad applicability.
Abstract
We introduce ProLIP, a simple and architecture-agnostic method for adapting contrastively pretrained vision-language models, such as CLIP, to few-shot classification. ProLIP fine-tunes the vision encoder's projection matrix with Frobenius norm regularization on its deviation from the pretrained weights. It achieves state-of-the-art performance on 11 few-shot classification benchmarks under both ``few-shot validation'' and ``validation-free'' settings. Moreover, by rethinking the non-linear CLIP-Adapter through ProLIP's lens, we design a Regularized Linear Adapter (RLA) that performs better, requires no hyperparameter tuning, is less sensitive to learning rate values, and offers an alternative to ProLIP in black-box scenarios where model weights are inaccessible. Beyond few-shot classification, ProLIP excels in cross-dataset transfer, domain generalization, base-to-new class generalization, and test-time adaptation--where it outperforms prompt tuning while being an order of magnitude faster to train. Code is available at https://github.com/astra-vision/ProLIP .
