Table of Contents
Fetching ...

iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection

Nan Zhou, Jiaxin Chen, Di Huang

TL;DR

This paper tackles the challenge of efficiently adapting vision transformers via visual prompt tuning by addressing the limited inter-layer sharing and vulnerability to input noise in existing prompts. It introduces iVPT, which combines cross-layer dynamic connection (CDC), dynamic aggregation (DA), and attentive reinforcement (AR) to enable task-relevant information sharing across prompt tokens and to reinforce salient image regions during attention. The approach is supported by theoretical insights and extensive experiments across 24 vision tasks, including VTAB-1k, FGVC, and ADE20k, showing state-of-the-art performance with minimal parameter overhead and strong generalizability across backbones and pre-training strategies. Overall, iVPT offers a flexible, robust, and scalable solution for prompting-based adaptation of vision transformers with practical impact for diverse vision tasks.

Abstract

Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can do harm to the sharing of task-relevant information. In this paper, we propose a novel VPT approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens, which are further enhanced by prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantage of the proposed iVPT, compared to the state-of-the-art counterparts.

iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection

TL;DR

This paper tackles the challenge of efficiently adapting vision transformers via visual prompt tuning by addressing the limited inter-layer sharing and vulnerability to input noise in existing prompts. It introduces iVPT, which combines cross-layer dynamic connection (CDC), dynamic aggregation (DA), and attentive reinforcement (AR) to enable task-relevant information sharing across prompt tokens and to reinforce salient image regions during attention. The approach is supported by theoretical insights and extensive experiments across 24 vision tasks, including VTAB-1k, FGVC, and ADE20k, showing state-of-the-art performance with minimal parameter overhead and strong generalizability across backbones and pre-training strategies. Overall, iVPT offers a flexible, robust, and scalable solution for prompting-based adaptation of vision transformers with practical impact for diverse vision tasks.

Abstract

Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can do harm to the sharing of task-relevant information. In this paper, we propose a novel VPT approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens, which are further enhanced by prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantage of the proposed iVPT, compared to the state-of-the-art counterparts.
Paper Structure (21 sections, 10 equations, 7 figures, 18 tables)

This paper contains 21 sections, 10 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Comparison of distinct VPT approaches. (a) VPT-deep truncates output prompt tokens and independently learns prompt tokens across layers. (b) Existing attempts preserve output prompt tokens and learn prompts on them. In contrast, (c) iVPT builds cross-layer dynamic connection (CDC) with dynamic aggregation (DA) on input prompt tokens, benefiting task-relevant information sharing, and introduces the attentive reinforcement (AR) module to highlight salient image regions.
  • Figure 2: Illustration on the proposed iVPT approach. 1) Cross-layer dynamic connection (CDC): dynamically aggregates (DA) and connects the input prompt tokens at adjacent layers. 2) Attentive reinforcement (AR): first identifies salient image tokens according to the attention weights with cls token and then enhances salient image tokens with learnable prompt tokens.
  • Figure 3: Illustration on different prompt structures. (A) VPT-deep jia2022vpt. (B) ProVP progressive, which preserves the output prompt tokens and adds learnable prompt tokens. (C) EXPRESS express, which preserves the output prompt tokens and adds prompt tokens before LN, QKV projection and MSA layers. (D) Vanilla CDC, which transfers the input prompt token from all preceding layers to the current layer. (E) CDC, which transfers the input prompt token solely from previous layer to the current layer.
  • Figure 4: Analysis on the robustness of performance against distinct rates of Gaussian noise in the input image, by comparing our CDC method and EXPRESS.
  • Figure 5: Ablation results on the number of prompt tokens ($N$) in CDC and the number of prompt tokens ($k$) in AR.
  • ...and 2 more figures