Table of Contents
Fetching ...

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models

Elaine Sui, Xiaohan Wang, Serena Yeung-Levy

TL;DR

This work tackles the problem of degraded zero-shot generalization for vision-language models under test-time domain shifts. It proposes Test-Time Prototype Shifting (TPS), which directly modulates per-class prototypes in the embedding space by learning class-specific shift vectors during test time, while keeping the large encoders fixed. TPS leverages pre-computed prototypes generated via prompt-engineering and optionally enriched descriptors, achieving state-of-the-art performance on natural distribution shifts and cross-dataset generalization, and extending to context-dependent visual reasoning; it does so with significantly lower memory and compute requirements than text-prompt tuning methods. The framework is plug-and-play with existing prompting strategies and yields practical gains in real-world deployment scenarios due to its efficiency and flexibility. Overall, TPS demonstrates that small, targeted feature-space perturbations can robustly bridge domain gaps for VLMs, enabling faster, scalable zero-shot generalization.

Abstract

Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets using unlabeled test inputs. Our method is based on the notion of modulating per-class prototypes in the shared embedding space. By pre-computing and caching prototypes generated with the pre-trained text encoder, TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering. At test-time, TPS dynamically learns shift vectors for each prototype based solely on the given test sample, effectively bridging the domain gap and enhancing classification accuracy. A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods. Extensive evaluations across 15 image classification datasets involving natural distribution shifts and cross-dataset generalization, as well as in context-dependent visual reasoning, demonstrate TPS's superior performance, achieving state-of-the-art results while reducing resource requirements.

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models

TL;DR

This work tackles the problem of degraded zero-shot generalization for vision-language models under test-time domain shifts. It proposes Test-Time Prototype Shifting (TPS), which directly modulates per-class prototypes in the embedding space by learning class-specific shift vectors during test time, while keeping the large encoders fixed. TPS leverages pre-computed prototypes generated via prompt-engineering and optionally enriched descriptors, achieving state-of-the-art performance on natural distribution shifts and cross-dataset generalization, and extending to context-dependent visual reasoning; it does so with significantly lower memory and compute requirements than text-prompt tuning methods. The framework is plug-and-play with existing prompting strategies and yields practical gains in real-world deployment scenarios due to its efficiency and flexibility. Overall, TPS demonstrates that small, targeted feature-space perturbations can robustly bridge domain gaps for VLMs, enabling faster, scalable zero-shot generalization.

Abstract

Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets using unlabeled test inputs. Our method is based on the notion of modulating per-class prototypes in the shared embedding space. By pre-computing and caching prototypes generated with the pre-trained text encoder, TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering. At test-time, TPS dynamically learns shift vectors for each prototype based solely on the given test sample, effectively bridging the domain gap and enhancing classification accuracy. A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods. Extensive evaluations across 15 image classification datasets involving natural distribution shifts and cross-dataset generalization, as well as in context-dependent visual reasoning, demonstrate TPS's superior performance, achieving state-of-the-art results while reducing resource requirements.
Paper Structure (42 sections, 3 equations, 3 figures, 16 tables, 1 algorithm)

This paper contains 42 sections, 3 equations, 3 figures, 16 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of Test-Time Prompt Tuning (TPT) shu2022tpt against our method, Test-Time Prototype Shifting (TPS). TPT requires gradients to backpropagate through the large text encoder in order to reach the tuneable prompt, incurring high memory and computational costs. In contrast, TPS only backpropagates gradients to the feature space, in which our class prototype shifts are learned, making it much more efficient.
  • Figure 2: We illustrate the three stages of Test-Time Prototype Shifting (TPS). 1) Prototype Generation: pre-computation of class prototypes using different prompt-engineering strategies. We show the computation of $k$ class-conditioned descriptors for a single class. Means are computed and cached. 2) Test-Time Shift Tuning: one iteration of test-time training where we tune the Shift Learner to generate small perturbations to the class prototypes to close the gap between the source and target distributions. Marginal entropy of the CLIP similarities of the shifted prototypes and augmented image embeddings is minimized. 3) Test-Time Inference: Using the tuned Shift Learner, we compute the final prediction for the shifted class prototypes and the original image embedding with CLIP similarity.
  • Figure 3: Comparison of computational and memory costs on an A6000 GPU on ImageNet. Left: Average runtimes of TPT and TPS across different sized subsets of ImageNet imagenet_cvpr09 over 3 runs. Note that error bars are depicted but are not visible as they have extremely small standard deviations. Right: Memory consumption of TPT and TPS on ImageNet.