Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

Chun-Mei Feng; Kai Yu; Yong Liu; Salman Khan; Wangmeng Zuo

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, Wangmeng Zuo

TL;DR

This paper tackles test-time prompt tuning under domain shifts by augmenting a single test sample with diverse, diffusion-generated images while enforcing semantic fidelity via cosine similarity filtration. By combining standard augmentation with diffusion-based data and filtering spurious examples, DiffTPT significantly improves zero-shot accuracy (average 5.13% over state-of-the-art) without requiring target-domain labels. The approach is validated across natural distribution shifts and cross-dataset generalization tasks, with ablations identifying effective settings for augmentation size, filtration thresholds, and prompt-update steps. The results highlight a practical, model-agnostic path to robust cross-domain performance for vision-language models like CLIP.

Abstract

Benefiting from prompt tuning, recent years have witnessed the promising performance of pre-trained vision-language models, e.g., CLIP, on versatile downstream tasks. In this paper, we focus on a particular setting of learning adaptive prompts on the fly for each test sample from an unseen new domain, which is known as test-time prompt tuning (TPT). Existing TPT methods typically rely on data augmentation and confidence selection. However, conventional data augmentation techniques, e.g., random resized crops, suffers from the lack of data diversity, while entropy-based confidence selection alone is not sufficient to guarantee prediction fidelity. To address these issues, we propose a novel TPT method, named DiffTPT, which leverages pre-trained diffusion models to generate diverse and informative new data. Specifically, we incorporate augmented data by both conventional method and pre-trained stable diffusion to exploit their respective merits, improving the models ability to adapt to unknown new test data. Moreover, to ensure the prediction fidelity of generated data, we introduce a cosine similarity-based filtration technique to select the generated data with higher similarity to the single test sample. Our experiments on test datasets with distribution shifts and unseen categories demonstrate that DiffTPT improves the zero-shot accuracy by an average of 5.13\% compared to the state-of-the-art TPT method. Our code and models will be publicly released.

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 11 figures, 1 table)

This paper contains 15 sections, 7 equations, 11 figures, 1 table.

Introduction
Related Work
Methodology
Test-time Prompt Tuning
Approach Overview
Diffusion-based Diverse Data Augmentation
Filtration with Cosine Similarity
Experiments
Experimental Setup
Comparison with State-of-the-arts
Ablation Studies
Conclusion
Comparisons of the Full Dataset on $\mathcal{S}_1$ and $\mathcal{S}_2$
Proportion Analysis of the Different Augmented Images
Visualization of the Generated and Filtered Image

Figures (11)

Figure 1: (a) Prior TPT method shu2022test uses different augmented views along with confidence selection, resulting in overly simplistic variants in the test data and unconfident yet correct predictions being discarded. In comparison, (b) our DiffTPT is effective in generating data with richer visual appearance variation and selecting generated data with higher prediction fidelity.
Figure 2: Overview of our proposed DiffTPT. We first (a) use the pre-trained stable diffusion to generate data with richer visual appearance variation, then (b) uses a cosine similarity based filtration with the single test sample to remove spurious augmentations, making our method a trade-off between diversity and fidelity.
Figure 3: Visualization of the diverse and informative diffusion-based augmented images and the filtered image by cosine similarity.
Figure 4: Top 1 accuracy$\%$ of state-of-the-art baselines under $\mathcal{S}_2$, where Avg. indicates average accuracies of the Cross-Datasets Generalization. The arrow ${\color{ForestGreen}\uparrow}$ and ${\color{red}\downarrow}$ indicate improvements and decrements of our method against the CLIP method, i.e., CLIP-RN50 and CLIP-ViT-B/16. Detailed analyses are provided in Sec. \ref{['sec:acc']}.
Figure 5: Variation of the top 1 accuracy versus the varied proportion of the standard augmented views and the diffusion-based augmented images under (a) $\mathcal{S}_1$ and (b) $\mathcal{S}_2$.
...and 6 more figures

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

TL;DR

Abstract

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)