Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization
Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
TL;DR
PromptAlign tackles the core challenge of zero-shot generalization under distribution shift for vision-language models by introducing test-time distribution alignment. It jointly tunes multi-modal prompts and aligns token distributions of test inputs to offline source statistics using a proxy dataset (commonly ImageNet) and augmented views of a single test sample. The alignment is formalized through $\mathcal{L}_{\text{align}} = \frac{1}{L} \sum_{l=1}^{L} ( || \mu_l(T;p) - \hat{\mu}_l ||_1 + || \sigma^2_l(T;p) - \hat{\sigma}^2_l ||_1 )$ and combined with the entropy objective as $\mathcal{L}_{\text{final}} = \mathcal{L}_{\text{entropy}} + \beta \mathcal{L}_{\text{align}}$, enabling updates to prompts on both image and text branches. Experiments on domain generalization and cross-dataset transfer show consistent improvements over baselines like MaPLe and TPT, with notable gains in domain generalization (e.g., 3.08% average Top-1 improvement) and robust cross-dataset performance. The approach demonstrates that token distribution alignment can significantly narrow the train-test distribution gap for CLIP-like models, using a scalable proxy dataset and minimal runtime overhead.
Abstract
The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks. Previous works have shown test-time prompt tuning using entropy minimization to adapt text prompts for unseen domains. While effective, this overlooks the key cause for performance degradation to unseen domains -- distribution shift. In this work, we explicitly handle this problem by aligning the out-of-distribution (OOD) test sample statistics to those of the source data using prompt tuning. We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Evaluating against the domain generalization benchmark, our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe. In cross-dataset generalization with unseen categories across 10 datasets, our method improves consistently across all datasets compared to the existing state-of-the-art. Our source code and models are available at https://jameelhassan.github.io/promptalign.
