Visual Variational Autoencoder Prompt Tuning

Xi Xiao; Yunbei Zhang; Yanshuh Li; Xingjian Li; Tianyang Wang; Jihun Hamm; Xiao Wang; Min Xu

Visual Variational Autoencoder Prompt Tuning

Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, Min Xu

TL;DR

The paper tackles the limitation of static Visual Prompt Tuning prompts by introducing V$^2$APT, a framework that generates instance-specific prompts via a Variational Autoencoder conditioned on input features. By encoding image embeddings into a latent distribution and decoding latent samples into prompts, the method produces dynamic prompts that are combined with domain prompts without increasing total prompt tokens. A KL divergence term regularizes the latent space to a standard Gaussian, improving generalization. Empirical results on FGVC, HTA, and VTAB-1k show state-of-the-art performance among PEFT methods, with notable gains on HTA and VTAB-1k and robustness across ViT and Swin backbones, highlighting the potential of latent-model-driven prompt synthesis for vision transformers.

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V$^2$APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V$^2$APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V$^2$APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

Visual Variational Autoencoder Prompt Tuning

TL;DR

The paper tackles the limitation of static Visual Prompt Tuning prompts by introducing V

APT, a framework that generates instance-specific prompts via a Variational Autoencoder conditioned on input features. By encoding image embeddings into a latent distribution and decoding latent samples into prompts, the method produces dynamic prompts that are combined with domain prompts without increasing total prompt tokens. A KL divergence term regularizes the latent space to a standard Gaussian, improving generalization. Empirical results on FGVC, HTA, and VTAB-1k show state-of-the-art performance among PEFT methods, with notable gains on HTA and VTAB-1k and robustness across ViT and Swin backbones, highlighting the potential of latent-model-driven prompt synthesis for vision transformers.

Abstract

APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V

APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V

APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

Visual Variational Autoencoder Prompt Tuning

TL;DR

Abstract

Visual Variational Autoencoder Prompt Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)