Do We Really Need a Large Number of Visual Prompts?
Youngeun Kim, Yuhang Li, Abhishek Moitra, Ruokai Yin, Priyadarshini Panda
TL;DR
This paper investigates how the number of visual prompt tokens in Visual Prompt Tuning affects fine-tuning accuracy and self-attention in Vision Transformers. It shows that increasing prompts yields non-linear gains and that self-attention remains effectively low-rank, with a logarithmic growth in rank as prompts are added. To combat accuracy degradation when using fewer prompts, the authors propose Prompt Condensation (PC), a gradient-based, global scoring approach that selects the most impactful prompts across all layers and fine-tunes only them, achieving around a 70% reduction in prompts while preserving performance. The method is validated on FGVC and VTAB-1k benchmarks using ViT-B/16 and Swin-B backbones, demonstrating substantial latency and FLOPs savings with minimal loss in accuracy, thereby making PETL on resource-constrained devices more practical.
Abstract
Due to increasing interest in adapting models on resource-constrained edges, parameter-efficient transfer learning has been widely explored. Among various methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input space, shows competitive fine-tuning performance compared to training of full network parameters. However, VPT increases the number of input tokens, resulting in additional computational overhead. In this paper, we analyze the impact of the number of prompts on fine-tuning performance and self-attention operation in a vision transformer architecture. Through theoretical and empirical analysis we show that adding more prompts does not lead to linear performance improvement. Further, we propose a Prompt Condensation (PC) technique that aims to prevent performance degradation from using a small number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show that our approach reduces the number of prompts by ~70% while maintaining accuracy.
