Table of Contents
Fetching ...

Bayesian Test-Time Adaptation for Vision-Language Models

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Xiatian Zhu, Lei Deng, Hongbin Liu, Zhen Lei

TL;DR

The paper addresses the problem of test-time adaptation for vision-language models by reframing CLIP-style zero-shot classification through Bayesian inference. It introduces Bayesian Class Adaptation (BCA), which jointly updates the likelihood via class-embedding refinements and the prior via posterior-based updates, enabling robust adaptation to distribution shifts. Empirical results on Cross Domain and OOD benchmarks show that BCA outperforms state-of-the-art methods while maintaining fast inference and modest memory demands. The work demonstrates that incorporating adaptive priors alongside likelihood updates yields tangible improvements in robustness for real-world deployment of vision-language systems.

Abstract

Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, \textbf{B}ayesian \textbf{C}lass \textbf{A}daptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.

Bayesian Test-Time Adaptation for Vision-Language Models

TL;DR

The paper addresses the problem of test-time adaptation for vision-language models by reframing CLIP-style zero-shot classification through Bayesian inference. It introduces Bayesian Class Adaptation (BCA), which jointly updates the likelihood via class-embedding refinements and the prior via posterior-based updates, enabling robust adaptation to distribution shifts. Empirical results on Cross Domain and OOD benchmarks show that BCA outperforms state-of-the-art methods while maintaining fast inference and modest memory demands. The work demonstrates that incorporating adaptive priors alongside likelihood updates yields tangible improvements in robustness for real-world deployment of vision-language systems.

Abstract

Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, \textbf{B}ayesian \textbf{C}lass \textbf{A}daptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.

Paper Structure

This paper contains 14 sections, 9 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Fixed Prior vs. Adaptive Prior: Comparison of Diagnosis Outcomes. In the fixed prior scenario, patients with fever are consistently diagnosed with the common cold, regardless of whether it is a normal period or a COVID-19 period. In contrast, the adaptive prior scenario adjusts the diagnosis based on the current context. During normal periods, patients with fever are diagnosed with the common cold, while during the COVID-19 period, they are more likely to be diagnosed with COVID-19. This demonstrates the importance of performing prior adaptation in different environments.
  • Figure 2: Overview of the proposed Bayesian Class Adaptation (BCA) method. When deploying CLIP to a test environment, $M$ class embeddings are initialized based on hand-crafted prompts, and the prior for each class embedding is initialized as a one-hot vector with the corresponding class set to 1. (a) Embedding: when $i$-th image arrives, it is encoded into a visual embedding $\bm{f}^v_i$ using visual encoder. (b) Likelihood adaptation: the probability $P(\bm{U}|\bm{x}_i)$ is calculated based on current likelihood to find class embedding $\bm{\mu}_s$ with the highest probability. This $\bm{\mu}_s$ is then updated using statistical method with $\bm{f}^v_i$ to adapt the likelihood. (c) Prior adaptation: the posterior $P(Y|\bm{x}_i)$ is calculated by multiplying $P(\bm{U}|\bm{x}_i)$ by the current prior. And the prior of $s$-th class embedding $P(Y|\bm{\mu}_s)$ is adapted with this posterior.
  • Figure 3: Sensitivity analysis with respect to $\tau/\bm{n}_1/\bm{n}_2$ on ImageNet for OOD benchmark and Aircraft for Cross Domain benchmark using ViT-B/16 as the visual backbone.
  • Figure 4: Prior visualization on OOD benchmark.