Table of Contents
Fetching ...

CLIP's Visual Embedding Projector is a Few-shot Cornucopia

Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

TL;DR

ProLIP presents a remarkably simple yet effective approach to adapting CLIP for few-shot tasks by only fine-tuning the vision embedding projection matrix with a Frobenius-norm regularizer toward the pretrained weights, balancing task adaptation with knowledge preservation. The method achieves state-of-the-art results across 11 few-shot benchmarks, and demonstrates robustness in validation-free settings, cross-dataset transfer, and test-time adaptation. The Regularized Linear Adapter (RLA) variant extends ProLIP principles to a black-box setting, while experiments show strong generalization across domain shifts and base-to-new scenarios. The work also provides extensive hyperparameter analyses, demonstrates speed advantages via precomputed features, and suggests a general framework for regularized, parameter-efficient adaptation of multimodal foundation models. Overall, ProLIP offers a practical, fast, and scalable path for deploying CLIP-like models in data-scarce and transfer-heavy contexts, with code and supplementary material supporting broad applicability.

Abstract

We introduce ProLIP, a simple and architecture-agnostic method for adapting contrastively pretrained vision-language models, such as CLIP, to few-shot classification. ProLIP fine-tunes the vision encoder's projection matrix with Frobenius norm regularization on its deviation from the pretrained weights. It achieves state-of-the-art performance on 11 few-shot classification benchmarks under both ``few-shot validation'' and ``validation-free'' settings. Moreover, by rethinking the non-linear CLIP-Adapter through ProLIP's lens, we design a Regularized Linear Adapter (RLA) that performs better, requires no hyperparameter tuning, is less sensitive to learning rate values, and offers an alternative to ProLIP in black-box scenarios where model weights are inaccessible. Beyond few-shot classification, ProLIP excels in cross-dataset transfer, domain generalization, base-to-new class generalization, and test-time adaptation--where it outperforms prompt tuning while being an order of magnitude faster to train. Code is available at https://github.com/astra-vision/ProLIP .

CLIP's Visual Embedding Projector is a Few-shot Cornucopia

TL;DR

ProLIP presents a remarkably simple yet effective approach to adapting CLIP for few-shot tasks by only fine-tuning the vision embedding projection matrix with a Frobenius-norm regularizer toward the pretrained weights, balancing task adaptation with knowledge preservation. The method achieves state-of-the-art results across 11 few-shot benchmarks, and demonstrates robustness in validation-free settings, cross-dataset transfer, and test-time adaptation. The Regularized Linear Adapter (RLA) variant extends ProLIP principles to a black-box setting, while experiments show strong generalization across domain shifts and base-to-new scenarios. The work also provides extensive hyperparameter analyses, demonstrates speed advantages via precomputed features, and suggests a general framework for regularized, parameter-efficient adaptation of multimodal foundation models. Overall, ProLIP offers a practical, fast, and scalable path for deploying CLIP-like models in data-scarce and transfer-heavy contexts, with code and supplementary material supporting broad applicability.

Abstract

We introduce ProLIP, a simple and architecture-agnostic method for adapting contrastively pretrained vision-language models, such as CLIP, to few-shot classification. ProLIP fine-tunes the vision encoder's projection matrix with Frobenius norm regularization on its deviation from the pretrained weights. It achieves state-of-the-art performance on 11 few-shot classification benchmarks under both ``few-shot validation'' and ``validation-free'' settings. Moreover, by rethinking the non-linear CLIP-Adapter through ProLIP's lens, we design a Regularized Linear Adapter (RLA) that performs better, requires no hyperparameter tuning, is less sensitive to learning rate values, and offers an alternative to ProLIP in black-box scenarios where model weights are inaccessible. Beyond few-shot classification, ProLIP excels in cross-dataset transfer, domain generalization, base-to-new class generalization, and test-time adaptation--where it outperforms prompt tuning while being an order of magnitude faster to train. Code is available at https://github.com/astra-vision/ProLIP .
Paper Structure (32 sections, 20 equations, 8 figures, 24 tables, 1 algorithm)

This paper contains 32 sections, 20 equations, 8 figures, 24 tables, 1 algorithm.

Figures (8)

  • Figure 1: Few-shot classification with CLIP.(a) Using a pre-trained CLIP, zero-shot classification is performed by measuring text and visual embeddings similarity. Among few-shot adaptation strategies of CLIP, (b) Linear Probing huang2024lp++silva2024closer trains a linear classifier of the visual features, (c) Adapters add external learnable MLPs gao2024clipzhang2022tip, (d) Prompt Tuning learns word embeddings zhou2022learningzhou2022conditionalzhu2023promptchen2023plot. Alternatively, we propose (e) ProLIP which does not introduce new weights and only fine-tunes the visual embedding linear projector.
  • Figure 2: ProLIP for few-shot adaptation. Whether the vision encoder is a CNN or a Transformer, ProLIP trains only the layer that projects the visual embeddings into the shared latent space. The text encoder is frozen, and the text embeddings of the $K$ target concepts are used as classification weights. Training with cross-entropy is regularized by a squared error loss ensuring weights of the projection layer to remain close to pretrained ones.
  • Figure 3: Regluarized Linear Adapter (RLA). RLA is a black-box version of ProLIP. Instead of fine-tuning the projection matrix, an external linear adapter is added and trained using cross-entropy loss, with squared-error regularization ensuring that the adapter's weights remain close to the identity matrix.
  • Figure 4: ProLIP sensitivity to hyperparameter. Accuracy of ProLIP as function of the hyperparameters (learning rate and regularization weight $\lambda$) for $N\in\{1,2,4,8,16\}$-shot settings. Each data point is an average over 11 datasets and 10 seeds.
  • Figure 5: Improving CLIP-Adapter with ProLIP's principles results in Regularized Linear Adapter (RLA) variant. We report classification accuracy (%) averaged over 11 datasets, 10 seeds, and 4 learning rates $\text{LR} {\in} \{10^{-5},10^{-4},10^{-3},10^{-2}\}$ for CLIP-Adapter with different $\alpha$ values, $\text{ProLIP}_\varnothing$ and RLA with $\lambda=1/N$. Variance is halved for readability.
  • ...and 3 more figures