Test-time Loss Landscape Adaptation for Zero-Shot Generalization in Vision-Language Models
Aodi Li, Liansheng Zhuang, Xiao Long, Minghong Yao, Shafei Wang
TL;DR
This work tackles zero-shot generalization under distribution shift for vision–language models by rethinking test-time adaptation through loss landscapes. It introduces Test-time Loss Landscape Adaptation (TLLA), a two-stage approach that first uses Sharpness-Aware Prompt Tuning (SAPT) to locate a training flat minimum, then applies Sharpness-based Test Sample Selection (STSS) to pick augmented test views whose landscapes align with that minimum, avoiding test-time backpropagation. Empirical results on domain generalization and cross-dataset benchmarks show state-of-the-art performance with substantial reductions in computation compared to prompt-tuning baselines, e.g., gains of $5.32$–$6.98$ percentage points on ImageNet variants and improvements across multiple datasets. Theoretical analysis provides a generalization bound framed by the loss-landscape distance between training and test distributions, supporting the intuition that alignment of flat minima yields more reliable predictions, especially for data closer to the training distribution.
Abstract
Test-time adaptation of pre-trained vision-language models has emerged as a technique for tackling distribution shifts during the test time. Although existing methods, especially those based on Test-time Prompt Tuning (TPT), have shown promising results, their high computational cost associated with parameter optimization presents challenges for scalability and practical application. This paper unveils the unnecessary nature of backpropagation in existing methods from a loss landscape perspective. Building on this insight, this paper proposes a simple yet effective framework called Test-time Loss Landscape Adaptation (TLLA). TLLA leverages the relative position between the training minimum and test loss landscapes to guide the adaptation process, avoiding the update of model parameters at test time. Specifically, it mainly consists of two main stages: In the prompt tuning stage, a Sharpness-Aware Prompt Tuning (SAPT) method is introduced to identify the training flat minimum, setting the foundation for the subsequent test-time adaptation; In the test stage, a Sharpness-based Test Sample Selection (STSS) approach is utilized to ensure the alignment of flat minima within the training loss landscape and each augmented test sample's loss landscape. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that TLLA achieves state-of-the-art performances while significantly reducing computational overhead. Notably, TLLA surpasses TPT by an average of 5.32\% and 6.98\% on four ImageNet variant datasets when employing ResNet50 and ViT-B/16 image encoders, respectively. The code will be available soon.
