Efficient Test-Time Prompt Tuning for Vision-Language Models

Yuhan Zhu; Guozhen Zhang; Chen Xu; Haocheng Shen; Xiaoxin Chen; Gangshan Wu; Limin Wang

Efficient Test-Time Prompt Tuning for Vision-Language Models

Yuhan Zhu, Guozhen Zhang, Chen Xu, Haocheng Shen, Xiaoxin Chen, Gangshan Wu, Limin Wang

TL;DR

Open-set recognition with vision-language models remains challenging due to the heavy inference cost of per-image test-time prompt tuning. The paper introduces Self-TPT, a framework that uses text-oriented self-supervised learning to adapt prompts for new classes at test time without per-image backpropagation, via Contrastive Prompt Tuning (CPT) and a Gradient Matching (GM) loss. Stage 1 co-trains prompts with SSL on source data; Stage 2 adapts prompts for the target class set using SSL; Stage 3 performs predictions with fixed prompts. Across three benchmarks, Self-TPT achieves state-of-the-art accuracy while dramatically reducing inference cost, demonstrating scalable, efficient open-set generalization for large vision-language models.

Abstract

Vision-language models have showcased impressive zero-shot classification capabilities when equipped with suitable text prompts. Previous studies have shown the effectiveness of test-time prompt tuning; however, these methods typically require per-image prompt adaptation during inference, which incurs high computational budgets and limits scalability and practical deployment. To overcome this issue, we introduce Self-TPT, a novel framework leveraging Self-supervised learning for efficient Test-time Prompt Tuning. The key aspect of Self-TPT is that it turns to efficient predefined class adaptation via self-supervised learning, thus avoiding computation-heavy per-image adaptation at inference. Self-TPT begins by co-training the self-supervised and the classification task using source data, then applies the self-supervised task exclusively for test-time new class adaptation. Specifically, we propose Contrastive Prompt Learning (CPT) as the key task for self-supervision. CPT is designed to minimize the intra-class distances while enhancing inter-class distinguishability via contrastive learning. Furthermore, empirical evidence suggests that CPT could closely mimic back-propagated gradients of the classification task, offering a plausible explanation for its effectiveness. Motivated by this finding, we further introduce a gradient matching loss to explicitly enhance the gradient similarity. We evaluated Self-TPT across three challenging zero-shot benchmarks. The results consistently demonstrate that Self-TPT not only significantly reduces inference costs but also achieves state-of-the-art performance, effectively balancing the efficiency-efficacy trade-off.

Efficient Test-Time Prompt Tuning for Vision-Language Models

TL;DR

Abstract

Paper Structure (16 sections, 12 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 6 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
Pipeline of Self-TPT
Contrastive Prompt Tuning
Gradient Matching
Experiments
Experimental Setup
Comparison with the State-of-the-Art Methods
Ablation Study
Conclusion
Additional Details
Additional Studies
Limitations and Future Work
...and 1 more sections

Figures (6)

Figure 1: TPT versus Self-TPT. (a) TPT learns prompts from source data (stage 1), then adapts them to individual samples for prediction (stages 2&3). (b) Self-TPT employs text-oriented self-supervised learning (SSL) for joint training (stage 1) and for new class adaptation (stage 2), followed by direct predictions for each image (stage 3). (c) We present the frame per second (FPS) and graphics memory usage for each method when applied to CLIP-B/16 using the same A100-80G GPU. The y-axis represents the average cross-dataset accuracy.
Figure 2: Overview of Self-TPT. Self-TPT operates in three stages. Stage 1: Conduct prompt learning on a source dataset, co-trained with a self-supervised loss. Stage 2: Perform test-time adaptation (TTA) for new class understanding via the self-supervised loss. Stage 3: Perform direct predictions on the target dataset without further adjustment of the prompts.
Figure 3: Contrastive Prompt Tuning and Gradient Similarity Analysis.
Figure 4: Study on source data quality: more classes or more shots?
Figure 5: Study on Hyperparameter Sensitivity.
...and 1 more figures

Efficient Test-Time Prompt Tuning for Vision-Language Models

TL;DR

Abstract

Efficient Test-Time Prompt Tuning for Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)