Table of Contents
Fetching ...

LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation

Can Jin, Ying Li, Mingyu Zhao, Shiyu Zhao, Zhenting Wang, Xiaoxiao He, Ligong Han, Tong Che, Dimitris N. Metaxas

TL;DR

LoR-VP introduces a low-rank visual prompting framework to efficiently adapt pre-trained vision models. By factorizing the prompt as $\mathbf{B} \cdot \mathbf{A}$ with rank $r \ll L$, it enables interaction across all image patches while sharing information along rows and columns, significantly reducing parameter counts. Empirically, LoR-VP outperforms state-of-the-art VP methods across seven architectures and four datasets, with up to 6× faster training and ~18× fewer prompt parameters, plus a notable accuracy gain. The approach also demonstrates strong robustness to out-of-distribution data and maintains favorable training efficiency, making it well-suited for resource-constrained deployment.

Abstract

Visual prompting has gained popularity as a method for adapting pre-trained models to specific tasks, particularly in the realm of parameter-efficient tuning. However, existing visual prompting techniques often pad the prompt parameters around the image, limiting the interaction between the visual prompts and the original image to a small set of patches while neglecting the inductive bias present in shared information across different patches. In this study, we conduct a thorough preliminary investigation to identify and address these limitations. We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP), which enables shared and patch-specific information across rows and columns of image pixels. Extensive experiments across seven network architectures and four datasets demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods, achieving up to 6 times faster training times, utilizing 18 times fewer visual prompt parameters, and delivering a 3.1% improvement in performance. The code is available as https://github.com/jincan333/LoR-VP.

LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation

TL;DR

LoR-VP introduces a low-rank visual prompting framework to efficiently adapt pre-trained vision models. By factorizing the prompt as with rank , it enables interaction across all image patches while sharing information along rows and columns, significantly reducing parameter counts. Empirically, LoR-VP outperforms state-of-the-art VP methods across seven architectures and four datasets, with up to 6× faster training and ~18× fewer prompt parameters, plus a notable accuracy gain. The approach also demonstrates strong robustness to out-of-distribution data and maintains favorable training efficiency, making it well-suited for resource-constrained deployment.

Abstract

Visual prompting has gained popularity as a method for adapting pre-trained models to specific tasks, particularly in the realm of parameter-efficient tuning. However, existing visual prompting techniques often pad the prompt parameters around the image, limiting the interaction between the visual prompts and the original image to a small set of patches while neglecting the inductive bias present in shared information across different patches. In this study, we conduct a thorough preliminary investigation to identify and address these limitations. We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP), which enables shared and patch-specific information across rows and columns of image pixels. Extensive experiments across seven network architectures and four datasets demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods, achieving up to 6 times faster training times, utilizing 18 times fewer visual prompt parameters, and delivering a 3.1% improvement in performance. The code is available as https://github.com/jincan333/LoR-VP.

Paper Structure

This paper contains 33 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Illustration of various visual prompting methods applied to target domain data: ❶ AutoVP(Pad): Focuses on optimizing the balance between image scaling and tunable parameter integration to enhance model responsiveness. ❷ Patch-Pad: Aims to enhance localized learning by surrounding each image patch with tunable visual prompts. ❸ Patch-Free: Provides maximum adaptability by allowing independent tuning of visual prompts for each patch, catering to diverse feature requirements across the image. ❹ Patch-Same: Promotes consistency in model training by applying uniform visual prompts across all patches, ensuring coherent feature learning across input.
  • Figure 2: Preliminary Investigation Results. Performance comparison of various VP designs. Our VP method demonstrates competitive or superior performance in several configurations. The final performance of each method is marked by $\bigstar$ or $\bullet$, with all results averaged over three runs.
  • Figure 3: Our VP Design. We resize the image to a resolution of $L\times L$ and initialize two low-rank matrices $\mathbf{B}$ and $\mathbf{A}$ as tunable parameters. $\mathbf{B} \cdot \mathbf{A}$ serves as the visual prompt and is directly added to the resized images. This design allows shared information in rows and columns and also allows patch-specific information in different patches.
  • Figure 4: Performance of ImageNet-1K and CLIP Pre-trained Models on Downstream Datasets. Overview of the performance of LoR-VP compared to four baseline methods. The final performance of each method is indicated by $\bigstar$ or $\bullet$, and all results are averaged over three runs. LoR-VP consistently outperforms all baselines across various models and datasets.
  • Figure 5: Performance of ImageNet-21K Pre-trained Models on ImageNet-1K and Tiny-ImageNet. Performance comparison of LoR-VP and four baseline methods. The models are pre-trained on either ImageNet-21K-P or ImageNet-21K and then tuned on the respective downstream datasets. The final performance results are denoted by $\bigstar$ or $\bullet$. All results are averaged over three runs. LoR-VP consistently outperforms all baselines across different models and datasets.
  • ...and 1 more figures