Table of Contents
Fetching ...

Visual Variational Autoencoder Prompt Tuning

Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, Min Xu

TL;DR

The paper tackles the limitation of static Visual Prompt Tuning prompts by introducing V$^2$APT, a framework that generates instance-specific prompts via a Variational Autoencoder conditioned on input features. By encoding image embeddings into a latent distribution and decoding latent samples into prompts, the method produces dynamic prompts that are combined with domain prompts without increasing total prompt tokens. A KL divergence term regularizes the latent space to a standard Gaussian, improving generalization. Empirical results on FGVC, HTA, and VTAB-1k show state-of-the-art performance among PEFT methods, with notable gains on HTA and VTAB-1k and robustness across ViT and Swin backbones, highlighting the potential of latent-model-driven prompt synthesis for vision transformers.

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V$^2$APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V$^2$APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V$^2$APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

Visual Variational Autoencoder Prompt Tuning

TL;DR

The paper tackles the limitation of static Visual Prompt Tuning prompts by introducing VAPT, a framework that generates instance-specific prompts via a Variational Autoencoder conditioned on input features. By encoding image embeddings into a latent distribution and decoding latent samples into prompts, the method produces dynamic prompts that are combined with domain prompts without increasing total prompt tokens. A KL divergence term regularizes the latent space to a standard Gaussian, improving generalization. Empirical results on FGVC, HTA, and VTAB-1k show state-of-the-art performance among PEFT methods, with notable gains on HTA and VTAB-1k and robustness across ViT and Swin backbones, highlighting the potential of latent-model-driven prompt synthesis for vision transformers.

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces VAPT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, VAPT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, VAPT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

Paper Structure

This paper contains 11 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparison of standard Visual Prompt Tuning (VPT) and our proposed V$^2$APT framework. (Left) VPT employs static Domain-Specific Prompts, which remain fixed across different images. (Middle) Our V$^2$APT framework introduces Instance-dependent Prompts, dynamically generated via a Variational Autoencoder (VAE) based on each input image. (Right) The VAE encodes image embeddings into a KL-divergence-regularized latent space, then decodes them into instance prompts that combine with domain-specific prompts before entering the transformer, maintaining the same token count as VPT.
  • Figure 2: Comparison of fine-tuning methods on VTAB-1k across different models. The number within parentheses indicates the percentage of trainable parameters for each approach. Our method consistently outperforms existing techniques on both ViT and Swin transformer architectures.
  • Figure 3: Visualization of prompt maps with and without VAE. The first column shows input images, while the second and third columns illustrate the learned prompt maps without and with VAE integration, respectively.