In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model
Junhui Yin, Xinyu Zhang, Lin Wu, Xiaojie Wang
TL;DR
The paper tackles distribution shifts in vision-language models by enabling test-time adaptation of a frozen CLIP through per-test-sample visual prompts guided by in-context labels. It introduces In-Context Prompt Learning (InCPL), which employs a language-to-vision translator (Token-Net) to generate visual prompts conditioned on a small set of in-context examples and a context-aware loss that combines unsupervised and supervised components. A cyclic learning strategy alternates between refining visual and textual prompts to promote cross-modal alignment while preserving CLIP weights. Extensive experiments on fine-grained and distribution-shift benchmarks show InCPL achieving state-of-the-art or competitive results, with strong ablations validating the importance of context information, language-aware prompting, and the cyclic optimization. This approach offers a practical pathway to deploy robust vision-language models across diverse tasks with minimal labeled data and without full model fine-tuning.
Abstract
Current pre-trained vision-language models, such as CLIP, have demonstrated remarkable zero-shot generalization capabilities across various downstream tasks. However, their performance significantly degrades when test inputs exhibit different distributions. In this paper, we explore the concept of test-time prompt tuning (TTPT), which facilitates the adaptation of the CLIP model to novel downstream tasks through a one-step unsupervised optimization that involves only test samples. Inspired by in-context learning in natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks, which empowers a pre-trained vision-language model with labeled examples as context information on downstream task. Specifically, InCPL associates a new test sample with very few labeled examples (sometimes just one) as context information, enabling reliable label estimation for the test sample and facilitating model adaptation. To achieve this, InCPL employs an efficient language-to-vision translator to explore the textual prior information for visual prompt learning. Further, we introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. Finally, we design a cyclic learning strategy for visual and textual prompts to ensure mutual synergy across different modalities. This enables a pre-trained, frozen CLIP model to adapt to any task using its learned adaptive prompt. Our method demonstrates superior performance and achieves state-of-the-art results across various downstream datasets.
