Table of Contents
Fetching ...

Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Jinlin Wu, Xiatian Zhu, Lei Deng, Hongbin Liu, Jiebo Luo, Zhen Lei

TL;DR

The paper tackles the brittleness of vision-language models under distribution shifts by introducing Bayesian Class Adaptation plus (BCA+), a training-free test-time adaptation framework applicable to both object recognition and open-vocabulary object detection. It leverages a dynamic cache to store and iteratively update class embeddings, spatial scales, and priors, formulating adaptation as Bayesian inference to fuse the original VLM output with a cache-based prediction. A dual-adaptation mechanism—likelihood adaptation (via feature embeddings and scales) and prior adaptation (via evolving class distributions)—coupled with uncertainty-guided fusion, yields robust, real-time predictions without backpropagation. Extensive experiments across OOD, cross-domain, and corrupted benchmarks show state-of-the-art performance and high efficiency, validating the practical impact for real-world vision systems. The approach also extends prior work by being the first to apply TTA to open-vocabulary detection with Grounding DINO, offering a unified, training-free solution for flexible and resilient perception across domains.

Abstract

Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model's semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.

Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

TL;DR

The paper tackles the brittleness of vision-language models under distribution shifts by introducing Bayesian Class Adaptation plus (BCA+), a training-free test-time adaptation framework applicable to both object recognition and open-vocabulary object detection. It leverages a dynamic cache to store and iteratively update class embeddings, spatial scales, and priors, formulating adaptation as Bayesian inference to fuse the original VLM output with a cache-based prediction. A dual-adaptation mechanism—likelihood adaptation (via feature embeddings and scales) and prior adaptation (via evolving class distributions)—coupled with uncertainty-guided fusion, yields robust, real-time predictions without backpropagation. Extensive experiments across OOD, cross-domain, and corrupted benchmarks show state-of-the-art performance and high efficiency, validating the practical impact for real-world vision systems. The approach also extends prior work by being the first to apply TTA to open-vocabulary detection with Grounding DINO, offering a unified, training-free solution for flexible and resilient perception across domains.

Abstract

Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model's semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.

Paper Structure

This paper contains 16 sections, 17 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustrative example of Fixed Prior vs. Adaptive Prior: Comparison of Diagnosis Outcomes. In the fixed prior scenario, patients with fever are consistently diagnosed with the common cold, regardless of whether it is a normal period or a COVID-19 period. In contrast, the adaptive prior scenario adjusts the diagnosis based on the current context. During normal periods, patients with fever are diagnosed with the common cold, while during the COVID-19 period, they are more likely to be diagnosed with COVID-19. This demonstrates the importance of performing prior adaptation in different environments.
  • Figure 2: Traditional test-time detection vs. VLM-based test-time detection. Traditional test-time detection requires training a dedicated detector (e.g., Faster R-CNN) for each specific domain. For example, a model trained on the "Cat" domain cannot be directly applied to the "Dog" domain and must be retrained from scratch, which is resource-intensive. Even during test-time adaptation, these methods rely on backpropagation for updates. While VLM-based method uses a single VLM (e.g., Grounding DINO) as a universal baseline, enabling detection for any category (e.g., both "Cat" and "Dog") without retraining. Our BCA+ framework enables training-free adaptation via a dynamic cache, eliminating backpropagation for real-time deployment.
  • Figure 3: Overview of the proposed BCA+ framework. The process is divided into three main stages: (a) VLM Inference, where the pre-trained vision-language model (CLIP for recognition or Grounding DINO for detection) processes the input image $\bm{x}_i$ to generate initial outputs, including visual embeddings $\{\bm{f}_{ij}^v\}_j$, initial class predictions $\{\bm{p}^{init}_{ij}\}_j$, and bounding boxes $\{\bm{b}_{ij}\}_j$ (for detection). (b) Bayesian Inference, which computes a cache-based prediction $\{\bm{p}^{cache}_{ij}\}_j$ by combining a likelihood (derived from feature and scale similarity with cached entries) and a dynamically updated prior. The final prediction $\{\bm{p}^{final}_{ij}\}_j$ is produced by fusing the initial and cache-based predictions with an uncertainty-guided strategy. (c) Cache Adaptation, where the cache is updated based on the final prediction: likelihood adaptation refines the cached feature embeddings $\{\bm{f}^{cache}_m\}_m$ and scales $\{\bm{b}^{cache}_m\}_m$, while prior adaptation updates the cached priors $\{\bm{v}^{cache}_m\}_m$. Orange lines denote components exclusive to object recognition, red lines denote components exclusive to object detection, and black lines denote components shared by both tasks.
  • Figure 4: Hyperparameter sensitivity analysis and cache dynamics analysis. (a) and (d) show the sensitivity of object recognition performance to the confidence threshold $\tau_1$ and the similarity threshold $\tau_2$ on the OOD and Cross-Domain benchmarks, respectively. (b) and (e) show the sensitivity of object detection performance to $\tau_1$ and $\tau_2$ on the PASCAL-C and COCO-C datasets, respectively. (c) shows the sensitivity of detection performance to the balance weight $w_s$ on the PASCAL-C and COCO-C datasets. (f) shows the number of cached entries $M$ over the image sequence on the COCO-C-Brit dataset.
  • Figure 5: Prior visualization on OOD benchmark.
  • ...and 1 more figures