Table of Contents
Fetching ...

Do We Really Need a Large Number of Visual Prompts?

Youngeun Kim, Yuhang Li, Abhishek Moitra, Ruokai Yin, Priyadarshini Panda

TL;DR

This paper investigates how the number of visual prompt tokens in Visual Prompt Tuning affects fine-tuning accuracy and self-attention in Vision Transformers. It shows that increasing prompts yields non-linear gains and that self-attention remains effectively low-rank, with a logarithmic growth in rank as prompts are added. To combat accuracy degradation when using fewer prompts, the authors propose Prompt Condensation (PC), a gradient-based, global scoring approach that selects the most impactful prompts across all layers and fine-tunes only them, achieving around a 70% reduction in prompts while preserving performance. The method is validated on FGVC and VTAB-1k benchmarks using ViT-B/16 and Swin-B backbones, demonstrating substantial latency and FLOPs savings with minimal loss in accuracy, thereby making PETL on resource-constrained devices more practical.

Abstract

Due to increasing interest in adapting models on resource-constrained edges, parameter-efficient transfer learning has been widely explored. Among various methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input space, shows competitive fine-tuning performance compared to training of full network parameters. However, VPT increases the number of input tokens, resulting in additional computational overhead. In this paper, we analyze the impact of the number of prompts on fine-tuning performance and self-attention operation in a vision transformer architecture. Through theoretical and empirical analysis we show that adding more prompts does not lead to linear performance improvement. Further, we propose a Prompt Condensation (PC) technique that aims to prevent performance degradation from using a small number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show that our approach reduces the number of prompts by ~70% while maintaining accuracy.

Do We Really Need a Large Number of Visual Prompts?

TL;DR

This paper investigates how the number of visual prompt tokens in Visual Prompt Tuning affects fine-tuning accuracy and self-attention in Vision Transformers. It shows that increasing prompts yields non-linear gains and that self-attention remains effectively low-rank, with a logarithmic growth in rank as prompts are added. To combat accuracy degradation when using fewer prompts, the authors propose Prompt Condensation (PC), a gradient-based, global scoring approach that selects the most impactful prompts across all layers and fine-tunes only them, achieving around a 70% reduction in prompts while preserving performance. The method is validated on FGVC and VTAB-1k benchmarks using ViT-B/16 and Swin-B backbones, demonstrating substantial latency and FLOPs savings with minimal loss in accuracy, thereby making PETL on resource-constrained devices more practical.

Abstract

Due to increasing interest in adapting models on resource-constrained edges, parameter-efficient transfer learning has been widely explored. Among various methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input space, shows competitive fine-tuning performance compared to training of full network parameters. However, VPT increases the number of input tokens, resulting in additional computational overhead. In this paper, we analyze the impact of the number of prompts on fine-tuning performance and self-attention operation in a vision transformer architecture. Through theoretical and empirical analysis we show that adding more prompts does not lead to linear performance improvement. Further, we propose a Prompt Condensation (PC) technique that aims to prevent performance degradation from using a small number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show that our approach reduces the number of prompts by ~70% while maintaining accuracy.
Paper Structure (12 sections, 2 theorems, 16 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 2 theorems, 16 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let ${A} \in \mathbb{R}^{n \times n}$ be a self-attention matrix, and $v \in \mathbb{R}^{n}$ be a column vector of value matrix $V$. Then, there exists a low-rank matrix $\Tilde{A} \in \mathbb{R}^{n \times n}$ satisfying where the rank of $\Tilde{A}$ is bounded, i.e., $rank(\Tilde{A}) = \Theta(log(n))$.

Figures (8)

  • Figure 1: Accuracy depending on the number of prompts used for VPT training. We transfer an ImageNet-22k pre-trained ViT-B/16 dosovitskiy2020image to three downstream tasks. The x-axis shows the relative number of prompts compared to the original number reported in jia2022visual. The vertical dotted line indicates the point where there is < 1% drop in accuracy from 100% number of prompts.
  • Figure 2: The normalized cumulative eigenvalue of self-attention matrix $A$ in Eq. \ref{['attention_eq']} on Stanford Cars and DMLab. We report the mean and standard deviation across all layers.
  • Figure 3: Accuracy changes by removing whole prompts in one layer. We report the original accuracy with a dotted line. Each dataset shows a different trend in accuracy degradation.
  • Figure 4: The test accuracy of VPT-Deep jia2022visual, PC w/o fine-tuning, and our proposed PC with respect to the number of prompts. We use the ViT-B/16 backbone. A dotted line represents the accuracy with 100% prompts.
  • Figure 5: The test accuracy of VPT-Shallow jia2022visual, PC w/o fine-tuning, and our proposed PC with respect to the number of prompts. We use the ViT-B/16 backbone. A dotted line represents the accuracy with 100% prompts.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1: Self-attention is low rank. Proved in wang2020linformer
  • Proposition 1
  • proof