Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
Elaine Sui, Xiaohan Wang, Serena Yeung-Levy
TL;DR
This work tackles the problem of degraded zero-shot generalization for vision-language models under test-time domain shifts. It proposes Test-Time Prototype Shifting (TPS), which directly modulates per-class prototypes in the embedding space by learning class-specific shift vectors during test time, while keeping the large encoders fixed. TPS leverages pre-computed prototypes generated via prompt-engineering and optionally enriched descriptors, achieving state-of-the-art performance on natural distribution shifts and cross-dataset generalization, and extending to context-dependent visual reasoning; it does so with significantly lower memory and compute requirements than text-prompt tuning methods. The framework is plug-and-play with existing prompting strategies and yields practical gains in real-world deployment scenarios due to its efficiency and flexibility. Overall, TPS demonstrates that small, targeted feature-space perturbations can robustly bridge domain gaps for VLMs, enabling faster, scalable zero-shot generalization.
Abstract
Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets using unlabeled test inputs. Our method is based on the notion of modulating per-class prototypes in the shared embedding space. By pre-computing and caching prototypes generated with the pre-trained text encoder, TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering. At test-time, TPS dynamically learns shift vectors for each prototype based solely on the given test sample, effectively bridging the domain gap and enhancing classification accuracy. A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods. Extensive evaluations across 15 image classification datasets involving natural distribution shifts and cross-dataset generalization, as well as in context-dependent visual reasoning, demonstrate TPS's superior performance, achieving state-of-the-art results while reducing resource requirements.
