Table of Contents
Fetching ...

Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models

Haoyang Li, Liang Wang, Chao Wang, Siyu Zhou, Jing Jiang, Yan Peng, Guodong Long

TL;DR

AugPT tackles data scarcity in CLIP-based prompt tuning by introducing Adaptive Self-supervised Augmentation (ASA), a Consensus-based Filtering Gate (CFG) that uses a frozen, high-capacity teacher to filter augmented views, and Optimized Prompt Distillation (OPD) that distills knowledge from a large teacher to a ViT-B/16 student via $\mathrm{KL}$ divergence. The method relies solely on internal augmentation of the existing unlabeled data, avoiding external knowledge or additional data collection. Empirical results across 11 datasets demonstrate consistent improvements in base-class accuracy and new-class generalization, with strong cross-dataset transfer and competitive few-shot performance. Overall, AugPT offers a practical, data-efficient pathway to adapt vision-language models for diverse downstream tasks with favorable inference speed and without external knowledge bottlenecks.

Abstract

For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: https://github.com/JREion/AugPT .

Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models

TL;DR

AugPT tackles data scarcity in CLIP-based prompt tuning by introducing Adaptive Self-supervised Augmentation (ASA), a Consensus-based Filtering Gate (CFG) that uses a frozen, high-capacity teacher to filter augmented views, and Optimized Prompt Distillation (OPD) that distills knowledge from a large teacher to a ViT-B/16 student via divergence. The method relies solely on internal augmentation of the existing unlabeled data, avoiding external knowledge or additional data collection. Empirical results across 11 datasets demonstrate consistent improvements in base-class accuracy and new-class generalization, with strong cross-dataset transfer and competitive few-shot performance. Overall, AugPT offers a practical, data-efficient pathway to adapt vision-language models for diverse downstream tasks with favorable inference speed and without external knowledge bottlenecks.

Abstract

For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: https://github.com/JREion/AugPT .

Paper Structure

This paper contains 45 sections, 21 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: (a) Impact of the amount of data (shots) per class on performance in prompt tuning models khattak2023mapleyao2023kgcoopkhattak2023promptsrc; (b) Visualization of bad cases in existing self-supervised augmentation approaches. In comparison, AugPT obtains quality improvement as in (c) by filtering noisy samples, surpassing the current state-of-the-art PromptKD li2024promptkd as backbone model in (d) base-to-new tasks in both full and few-shot scenarios over 11 datasets.
  • Figure 2: Framework of AugPT. In (a) fine-tuning stage, AugPT applies Adaptive Augmentation to raw unlabeled images $I'$, then reuses the teacher model from distillation-based backbone to discard noisy samples in the augmented image set $\mathcal{D}(I')$ via Top-1 Consensus-based Filtering Gate, followed by fine-tuning through Optimized Prompt Distillation by fitting the logits of augmented data between teacher and student. Tuned student image prompts $\boldsymbol{P}_{v}^{\text{stu}}$ and projection layer $\text{Proj}(\cdot)$ are applied in (b) inference stage to align student visual features with teacher text embeddings $g^{\text{tea}}({\boldsymbol{P}_{t}^{\text{tea}}})$.
  • Figure 3: Relative to the backbone model, the HM performance improvement under different (a) augmented image number $N$, (b) step size $S$ and (c) filtering gate operator.
  • Figure 4: The framework of (a) the teacher model used in distillation-based prompt tuning approaches. In (b) AugPT, the pre-tuned teacher image prompts $\boldsymbol{P}_{v}^{\text{tea}}$ and text prompts $\boldsymbol{P}_{t}^{\text{tea}}$ are utilized for online inference in Consensus-based Filtering Gate (Sec. 3.3 in main text). Moreover, textual features obtained by ViT-L/14-based teacher text modality are used in Optimized Prompt Distillation (Sec. 3.4 in main text) for interacting with student visual features.
  • Figure 5: Pseudo-code of AugPT in PyTorch.
  • ...and 1 more figures