Table of Contents
Fetching ...

Low-Rank Few-Shot Adaptation of Vision-Language Models

Maxime Zanella, Ismail Ben Ayed

TL;DR

This paper tackles the challenge of few-shot adaptation for Vision-Language Models by exploring Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA). It introduces CLIP-LoRA, which applies low-rank updates to both vision and language encoders, using a fixed set of hyper-parameters across 11 datasets. Through extensive ablations, the authors show that LoRA-based fine-tuning can surpass state-of-the-art prompt- and adapter-based methods while reducing training time and memory overhead. The findings suggest that LoRA provides a strong, scalable baseline for fair comparison and progress in the rapidly evolving area of few-shot VLMs, and they offer guidance on where to place LoRA modules and how to choose ranks. Overall, this work highlights LoRA as a practical, competitive alternative for efficient cross-modal fine-tuning with minimal dataset-specific tuning.

Abstract

Recent progress in the few-shot adaptation of Vision-Language Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, task-specific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of prompt-learning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.

Low-Rank Few-Shot Adaptation of Vision-Language Models

TL;DR

This paper tackles the challenge of few-shot adaptation for Vision-Language Models by exploring Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA). It introduces CLIP-LoRA, which applies low-rank updates to both vision and language encoders, using a fixed set of hyper-parameters across 11 datasets. Through extensive ablations, the authors show that LoRA-based fine-tuning can surpass state-of-the-art prompt- and adapter-based methods while reducing training time and memory overhead. The findings suggest that LoRA provides a strong, scalable baseline for fair comparison and progress in the rapidly evolving area of few-shot VLMs, and they offer guidance on where to place LoRA modules and how to choose ranks. Overall, this work highlights LoRA as a practical, competitive alternative for efficient cross-modal fine-tuning with minimal dataset-specific tuning.

Abstract

Recent progress in the few-shot adaptation of Vision-Language Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, task-specific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of prompt-learning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.
Paper Structure (19 sections, 8 equations, 3 figures, 5 tables)

This paper contains 19 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Different categories of Parameter-Efficient Fine-Tuning (PEFT) methods during (a) training, and (b) inference.
  • Figure 2: Detailed few-shot learning results on the 10 fine-grained datasets and ImageNet with the ViT-B/16 visual backbone. Average performance for the ViT-B/16, ViT-B/32 and ViT-L/14 on the same 11 datasets is reported in the last three plots, respectively.
  • Figure 3: Top-1 accuracy with 4-shots for different matrices of the attention bloc and increasing rank, when the low-rank matrices are positioned at every level of the encoders (All). The fourth bar plot study the impact of positioning the low-rank matrices only on the half last levels (Up), the first half levels (Bottom), or at every level (All). Reported top-1 accuracy is averaged over 3 random seeds.