A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Julio Silva-Rodríguez; Sina Hajimiri; Ismail Ben Ayed; Jose Dolz

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed, Jose Dolz

TL;DR

This work examines the few-shot transfer learning problem for large vision-language models and reveals that state-of-the-art adapters require task-specific hyperparameter tuning and can underperform zero-shot when distribution shifts occur. It introduces CLAP, a class-adaptive constrained linear probe based on an Augmented Lagrangian Multiplier, which balances preserving zero-shot prototypes with adapting to few-shot data without relying on validation data. Across 11 datasets and domain-shift scenarios, CLAP demonstrates consistent improvements over state-of-the-art ETL approaches while using a minimal number of trainable parameters. The findings suggest that robust, validation-free adaptation is achievable with principled constraint-based optimization, challenging the notion that extensive hyperparameter tuning is necessary for good few-shot transfer performance.

Abstract

Efficient transfer learning (ETL) is receiving increasing attention to adapt large pre-trained language-vision models on downstream tasks with a few labeled samples. While significant progress has been made, we reveal that state-of-the-art ETL approaches exhibit strong performance only in narrowly-defined experimental setups, and with a careful adjustment of hyperparameters based on a large corpus of labeled samples. In particular, we make two interesting, and surprising empirical observations. First, to outperform a simple Linear Probing baseline, these methods require to optimize their hyper-parameters on each target task. And second, they typically underperform -- sometimes dramatically -- standard zero-shot predictions in the presence of distributional drifts. Motivated by the unrealistic assumptions made in the existing literature, i.e., access to a large validation set and case-specific grid-search for optimal hyperparameters, we propose a novel approach that meets the requirements of real-world scenarios. More concretely, we introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing term is optimized via an adaptation of the general Augmented Lagrangian method tailored to this context. We comprehensively evaluate CLAP on a broad span of datasets and scenarios, demonstrating that it consistently outperforms SoTA approaches, while yet being a much more efficient alternative.

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

TL;DR

Abstract

Paper Structure (44 sections, 15 equations, 8 figures, 11 tables)

This paper contains 44 sections, 15 equations, 8 figures, 11 tables.

Introduction
Related work
Preliminaries
Contrastive vision-language pre-training
Transferability
Zero-shot inference.
Few-shot learning.
Efficient transfer learning with adapters
Pitfalls of existing few-shot ETL methods
Proposed approach
Revisiting Linear Probing
Constrained Linear Probing
Retaining prior knowledge.
Sample and class-specific constraints.
Class Adaptive Constraint for Linear Probing
...and 29 more sections

Figures (8)

Figure 1: Pitfalls of few-shot adapters due to the absence of a model selection strategy. The cross-shift model selection matrices $(i,j)$ depict the relative improvement w.r.t. a zero-shot initialized Linear Probing when using the optimal hyperparameters for the dataset $i$ (rows), for adapting in another task $j$ (columns), for each SoTA method (first three plots) and our approach (last plot).
Figure 2: Pitfalls of few-shot adapters due to the absence of a model selection strategy - Additional methods. The cross-shift model selection matrices $(i,j)$ depict the relative improvement w.r.t. a zero-shot initialized Linear Probing when using the optimal hyperparameters for the dataset $i$, for adapting in another task $j$, for each SoTA method (first four plots) and our approach (last plot). This is an extended version of \ref{['fig:cross-shift']} in the main manuscript.
Figure 3: Linear Probing learning curves. Results of Linear Probing-based methods when adapted to ImageNet using ResNet-50 as a backbone, $16$ shots per class as a support set, and a training scheduler using SGD. During training, both support set accuracy (top) and the performance on the test subset (bottom) are monitored, and the maximum test accuracy is highlighted in the curves. The training scheduler is described in \ref{['main:subsection_setup']}.
Figure 4: The trade-off between convergence on support set and generalization for zero-shot initialized adapters. We depict the performance on the support and test subsets (after training) of zero-shot initialized Linear Probing adapters. Red numbers indicate the initial learning rate used, on the fixed scheduler described in \ref{['main:subsection_setup']}. Two methods are presented: zero-shot initialized Linear Probe (ZS-LP, top, see \ref{['main:subsection_lp']}), and class adaptive Linear Probe (CLAP, bottom, see \ref{['main:subsection_class_addaptive_lp']}).
Figure 5: Trade-off between number of shots, trainable parameters, and adaptation performance. The test accuracy is presented with respect to the number of trainable parameters for CLIP-Adapter gao2021clip, TIP-Adapter(f) zhang2021tip, and the two proposed solutions in this work: a revisited Linear Probing (ZS-LP, see \ref{['main:subsection_lp']}), and a class-adaptive Linear Probing (CLAP, see \ref{['main:subsection_class_addaptive_lp']}). Results were obtained for 1 to 8 shots in the ImageNet dataset.
...and 3 more figures

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

TL;DR

Abstract

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)