Table of Contents
Fetching ...

One Head Eight Arms: Block Matrix based Low Rank Adaptation for CLIP-based Few-Shot Learning

Chunpeng Zhou, Qianqian Shen, Zhi Yu, Jiajun Bu, Haishuai Wang

TL;DR

This paper introduces Block-LoRA, a Block Matrix-based Low Rank Adaptation for fine-tuning CLIP in few-shot learning. By partitioning the LoRA update $\Delta \mathbf{W} = \mathbf{A}\mathbf{B}$ into $n$ blocks and sharing a down-projection $\mathbf{A_s}$, Block-LoRA reduces parameter count and computation, enabling CLIP fine-tuning on ImageNet few-shot with a single 24GB GPU and providing a tighter generalization bound than vanilla LoRA. Theoretical results bound the generalization error more tightly for Block-LoRA, and experiments across 11 datasets (plus cross-dataset and domain generalization tasks) show competitive performance with lower overhead compared to state-of-the-art CLIP-based few-shot methods. These findings suggest Block-LoRA as an efficient, device-friendly approach for adapting Vision-Language Foundation Models to downstream few-shot tasks, with potential applicability to other multimodal or transfer-learning scenarios.

Abstract

Recent advancements in fine-tuning Vision-Language Foundation Models (VLMs) have garnered significant attention for their effectiveness in downstream few-shot learning tasks.While these recent approaches exhibits some performance improvements, they often suffer from excessive training parameters and high computational costs. To address these challenges, we propose a novel Block matrix-based low-rank adaptation framework, called Block-LoRA, for fine-tuning VLMs on downstream few-shot tasks. Inspired by recent work on Low-Rank Adaptation (LoRA), Block-LoRA partitions the original low-rank decomposition matrix of LoRA into a series of sub-matrices while sharing all down-projection sub-matrices. This structure not only reduces the number of training parameters, but also transforms certain complex matrix multiplication operations into simpler matrix addition, significantly lowering the computational cost of fine-tuning. Notably, Block-LoRA enables fine-tuning CLIP on the ImageNet few-shot benchmark using a single 24GB GPU. We also show that Block-LoRA has the more tighter bound of generalization error than vanilla LoRA. Without bells and whistles, extensive experiments demonstrate that Block-LoRA achieves competitive performance compared to state-of-the-art CLIP-based few-shot methods, while maintaining a low training parameters count and reduced computational overhead.

One Head Eight Arms: Block Matrix based Low Rank Adaptation for CLIP-based Few-Shot Learning

TL;DR

This paper introduces Block-LoRA, a Block Matrix-based Low Rank Adaptation for fine-tuning CLIP in few-shot learning. By partitioning the LoRA update into blocks and sharing a down-projection , Block-LoRA reduces parameter count and computation, enabling CLIP fine-tuning on ImageNet few-shot with a single 24GB GPU and providing a tighter generalization bound than vanilla LoRA. Theoretical results bound the generalization error more tightly for Block-LoRA, and experiments across 11 datasets (plus cross-dataset and domain generalization tasks) show competitive performance with lower overhead compared to state-of-the-art CLIP-based few-shot methods. These findings suggest Block-LoRA as an efficient, device-friendly approach for adapting Vision-Language Foundation Models to downstream few-shot tasks, with potential applicability to other multimodal or transfer-learning scenarios.

Abstract

Recent advancements in fine-tuning Vision-Language Foundation Models (VLMs) have garnered significant attention for their effectiveness in downstream few-shot learning tasks.While these recent approaches exhibits some performance improvements, they often suffer from excessive training parameters and high computational costs. To address these challenges, we propose a novel Block matrix-based low-rank adaptation framework, called Block-LoRA, for fine-tuning VLMs on downstream few-shot tasks. Inspired by recent work on Low-Rank Adaptation (LoRA), Block-LoRA partitions the original low-rank decomposition matrix of LoRA into a series of sub-matrices while sharing all down-projection sub-matrices. This structure not only reduces the number of training parameters, but also transforms certain complex matrix multiplication operations into simpler matrix addition, significantly lowering the computational cost of fine-tuning. Notably, Block-LoRA enables fine-tuning CLIP on the ImageNet few-shot benchmark using a single 24GB GPU. We also show that Block-LoRA has the more tighter bound of generalization error than vanilla LoRA. Without bells and whistles, extensive experiments demonstrate that Block-LoRA achieves competitive performance compared to state-of-the-art CLIP-based few-shot methods, while maintaining a low training parameters count and reduced computational overhead.

Paper Structure

This paper contains 21 sections, 17 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Architecture comparison of the different CLIP-based few-shot learning methods with ours Block-LoRA.
  • Figure 2: The detail structure of our proposed Block-LoRA.
  • Figure 3: Block-LoRA performance comparison in few-shot classification tasks.
  • Figure 4: Comparisons of the actual training parameters counts.