Table of Contents
Fetching ...

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

Shentong Mo, Yansen Wang, Xufang Luo, Dongsheng Li

TL;DR

This paper addresses forgetting in visual prompt tuning for self-supervised Vision Transformers by introducing Long-term Spatial Prompt Tuning (LSPT), which combines a global spatial prompt coding module with a Long-term Prompt Coding (LPC) mechanism based on a shared LSTM. The approach preserves information from earlier blocks while accumulating spatial cues from patch tokens, enabling more robust transfer to fine-grained classification and semantic segmentation tasks. Extensive experiments on 5 FGVC datasets and 19 VTAB-1K tasks, across MAE and MoCo v3 pretraining, show that LSPT consistently outperforms state-of-the-art baselines, with ablations confirming the additive benefits of its components and visualizations illustrating improved category-aware attention. The work demonstrates a principled, efficient method to mitigate both temporal and spatial forgetting in visual prompting, with potential implications for broader cross-modal prompting research.

Abstract

Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. Contemporary VPT methodologies, especially when employed with self-supervised vision transformers, often default to the introduction of new learnable prompts or gated prompt tokens predominantly sourced from the model's previous block. A pivotal oversight in such approaches is their failure to harness the potential of long-range previous blocks as sources of prompts within each self-supervised ViT. To bridge this crucial gap, we introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning. Drawing inspiration from the intricacies of the human brain, LSPT ingeniously incorporates long-term gated prompts. This feature serves as temporal coding, curbing the risk of forgetting parameters acquired from earlier blocks. Further enhancing its prowess, LSPT brings into play patch tokens, serving as spatial coding. This is strategically designed to perpetually amass class-conscious features, thereby fortifying the model's prowess in distinguishing and identifying visual categories. To validate the efficacy of our proposed method, we engaged in rigorous experimentation across 5 FGVC and 19 VTAB-1K benchmarks. Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

TL;DR

This paper addresses forgetting in visual prompt tuning for self-supervised Vision Transformers by introducing Long-term Spatial Prompt Tuning (LSPT), which combines a global spatial prompt coding module with a Long-term Prompt Coding (LPC) mechanism based on a shared LSTM. The approach preserves information from earlier blocks while accumulating spatial cues from patch tokens, enabling more robust transfer to fine-grained classification and semantic segmentation tasks. Extensive experiments on 5 FGVC datasets and 19 VTAB-1K tasks, across MAE and MoCo v3 pretraining, show that LSPT consistently outperforms state-of-the-art baselines, with ablations confirming the additive benefits of its components and visualizations illustrating improved category-aware attention. The work demonstrates a principled, efficient method to mitigate both temporal and spatial forgetting in visual prompting, with potential implications for broader cross-modal prompting research.

Abstract

Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. Contemporary VPT methodologies, especially when employed with self-supervised vision transformers, often default to the introduction of new learnable prompts or gated prompt tokens predominantly sourced from the model's previous block. A pivotal oversight in such approaches is their failure to harness the potential of long-range previous blocks as sources of prompts within each self-supervised ViT. To bridge this crucial gap, we introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning. Drawing inspiration from the intricacies of the human brain, LSPT ingeniously incorporates long-term gated prompts. This feature serves as temporal coding, curbing the risk of forgetting parameters acquired from earlier blocks. Further enhancing its prowess, LSPT brings into play patch tokens, serving as spatial coding. This is strategically designed to perpetually amass class-conscious features, thereby fortifying the model's prowess in distinguishing and identifying visual categories. To validate the efficacy of our proposed method, we engaged in rigorous experimentation across 5 FGVC and 19 VTAB-1K benchmarks. Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.
Paper Structure (35 sections, 4 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 4 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of the forgetting problem in GaPT and the shape information awareness in our LSPT. For the 12th block, the attention map of the state-of-the-art approach has been blur and almost lose the crucial spatial information. While for our LSPT, we can see a clear attention map for the object in the raw image, demonstrating its ability to incorporate spatial information and pass it through long-range blocks.
  • Figure 2: Illustration of the proposed Long-term Spatial Prompt Tuning (LSPT) framework. For transformer block $l$, the Global Spatial Prompt Coding (GSPC) module adds the average embeddings of patch tokens $\mathbf{X}^{l}\in\mathbb{R}^{N\times D}$ from the block to the output prompts $\widehat{\mathbf{X}}_P^{l}\in\mathbb{R}^{N_p\times D}$ to generate global spatial prompts $\widehat{\mathbf{X}}_{SP}^{l}$. With the inserted prompt tokens $\mathbf{P}^l\in\mathbb{R}^{N_p\times D}$ and $\widehat{\mathbf{X}}_{SP}^{l}\in\mathbb{R}^{N_p\times D}$, the Long-term Prompt Coding (LPC) module with parallel importance takes the inserted prompts $\mathbf{P}^{l}$ as input and $\widehat{\mathbf{X}}_{SP}^{l}$ as hidden states, and the output context embeddings $\mathbf{C}^{l-1}\in\mathbb{R}^{N_p\times D}$ at block $l-1$ are fed into the layer as cell states. Finally, the output updated prompts $\mathbf{X}_P^{l}$ is used as the new prompt tokens for block $l+1$ to achieve long-term prompt coding.
  • Figure 3: Qualitative visualization of long-term prompt forgetting in state-of-the-art visual prompt tuning method yoo2023improving. From left to right: layer 1 to layer 12.
  • Figure 4: Qualitative visualization of spatial attention forgetting in state-of-the-art visual prompt tuning method yoo2023improving. From left to right: layer 1 to layer 12.
  • Figure 3.1: Qualitative visualization of learned category-aware attention maps learned by the proposed LSPT.