Table of Contents
Fetching ...

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Junhui Yin, Xinyu Zhang, Lin Wu, Xiaojie Wang

TL;DR

The paper tackles distribution shifts in vision-language models by enabling test-time adaptation of a frozen CLIP through per-test-sample visual prompts guided by in-context labels. It introduces In-Context Prompt Learning (InCPL), which employs a language-to-vision translator (Token-Net) to generate visual prompts conditioned on a small set of in-context examples and a context-aware loss that combines unsupervised and supervised components. A cyclic learning strategy alternates between refining visual and textual prompts to promote cross-modal alignment while preserving CLIP weights. Extensive experiments on fine-grained and distribution-shift benchmarks show InCPL achieving state-of-the-art or competitive results, with strong ablations validating the importance of context information, language-aware prompting, and the cyclic optimization. This approach offers a practical pathway to deploy robust vision-language models across diverse tasks with minimal labeled data and without full model fine-tuning.

Abstract

Current pre-trained vision-language models, such as CLIP, have demonstrated remarkable zero-shot generalization capabilities across various downstream tasks. However, their performance significantly degrades when test inputs exhibit different distributions. In this paper, we explore the concept of test-time prompt tuning (TTPT), which facilitates the adaptation of the CLIP model to novel downstream tasks through a one-step unsupervised optimization that involves only test samples. Inspired by in-context learning in natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks, which empowers a pre-trained vision-language model with labeled examples as context information on downstream task. Specifically, InCPL associates a new test sample with very few labeled examples (sometimes just one) as context information, enabling reliable label estimation for the test sample and facilitating model adaptation. To achieve this, InCPL employs an efficient language-to-vision translator to explore the textual prior information for visual prompt learning. Further, we introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. Finally, we design a cyclic learning strategy for visual and textual prompts to ensure mutual synergy across different modalities. This enables a pre-trained, frozen CLIP model to adapt to any task using its learned adaptive prompt. Our method demonstrates superior performance and achieves state-of-the-art results across various downstream datasets.

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

TL;DR

The paper tackles distribution shifts in vision-language models by enabling test-time adaptation of a frozen CLIP through per-test-sample visual prompts guided by in-context labels. It introduces In-Context Prompt Learning (InCPL), which employs a language-to-vision translator (Token-Net) to generate visual prompts conditioned on a small set of in-context examples and a context-aware loss that combines unsupervised and supervised components. A cyclic learning strategy alternates between refining visual and textual prompts to promote cross-modal alignment while preserving CLIP weights. Extensive experiments on fine-grained and distribution-shift benchmarks show InCPL achieving state-of-the-art or competitive results, with strong ablations validating the importance of context information, language-aware prompting, and the cyclic optimization. This approach offers a practical pathway to deploy robust vision-language models across diverse tasks with minimal labeled data and without full model fine-tuning.

Abstract

Current pre-trained vision-language models, such as CLIP, have demonstrated remarkable zero-shot generalization capabilities across various downstream tasks. However, their performance significantly degrades when test inputs exhibit different distributions. In this paper, we explore the concept of test-time prompt tuning (TTPT), which facilitates the adaptation of the CLIP model to novel downstream tasks through a one-step unsupervised optimization that involves only test samples. Inspired by in-context learning in natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks, which empowers a pre-trained vision-language model with labeled examples as context information on downstream task. Specifically, InCPL associates a new test sample with very few labeled examples (sometimes just one) as context information, enabling reliable label estimation for the test sample and facilitating model adaptation. To achieve this, InCPL employs an efficient language-to-vision translator to explore the textual prior information for visual prompt learning. Further, we introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. Finally, we design a cyclic learning strategy for visual and textual prompts to ensure mutual synergy across different modalities. This enables a pre-trained, frozen CLIP model to adapt to any task using its learned adaptive prompt. Our method demonstrates superior performance and achieves state-of-the-art results across various downstream datasets.
Paper Structure (16 sections, 3 equations, 8 figures, 8 tables)

This paper contains 16 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of the proposed in-context prompt learning (InCPL) framework. (a) We employ a token network to convert language descriptions into visual prompts, which serve as inputs to the vision encoder of the CLIP model. (b) We construct a test sample coupled with in-context examples and further introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. (c) We design a cyclic learning strategy to seamlessly integrate the visual prompt with the text prompt, which effectively retrieves the pre-trained knowledge relevant to test data across different modalities.
  • Figure 2: The comparison between our language-aware visual prompt approach and the patched, padded, and token-based visual prompt methods.
  • Figure 3: The illustration of token net.
  • Figure 4: Illustration of the proposed visual in-context prompt learning for test-time visual recognition. Each in-context example ($x_i, y_i$), test sample $x_t$, and its prefix text are fed into a token encoder to obtain visual, prefix, and text tokens. The text tokens are translated into a visual prompt using a token network. We optimize the visual tokens using a context-aware objective: a supervised cross-entropy term $L(x_i, y_i,\bm{P}_{\text{v}})$ involving in-context examples and an unsupervised entropy minimization $L(x_t,\bm{P}_{\text{v}})$ with the test sample.
  • Figure 5: Experimental results w.r.t varied number of in-context examples.
  • ...and 3 more figures