Ultra-Light Test-Time Adaptation for Vision--Language Models

Byunghyun Kim

Ultra-Light Test-Time Adaptation for Vision--Language Models

Byunghyun Kim

TL;DR

Vision–Language Models suffer from domain shift, causing feature drift, label-prior mismatch, and miscalibration. UL-TTA provides a fully training-free, backprop-free solution that freezes the backbone and updates only logit-level parameters ($t_c$, $\pi_c$, $\tau$) via an online Bayesian head with selective evidence and simple guards, achieving strong accuracy and calibration gains with minimal latency. By modeling the head with MAP updates anchored to text prompts and a Dirichlet prior, and by decoupling prediction and calibration temperatures, UL-TTA delivers stable, single-pass adaptation across diverse cross-domain and OOD benchmarks (~726K test samples) without backbone updates. The results show logit-level Bayesian adaptation suffices for favorable accuracy–calibration trade-offs, motivating practical deployment of ultra-light TTA in real-time systems and opening avenues for open-set extensions and meta-learned gating in future work.

Abstract

Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot recognition by comparing image embeddings to text-derived class prototypes. However, under domain shift, they suffer from feature drift, class-prior mismatch, and severe miscalibration. Existing test-time adaptation (TTA) methods often require backpropagation through large backbones, covariance estimation, or heavy memory/state, which is problematic for streaming and edge scenarios. We propose Ultra-Light Test-Time Adaptation (UL-TTA), a fully training-free and backprop-free framework that freezes the backbone and adapts only logit-level parameters: class prototypes, class priors, and temperature. UL-TTA performs an online EM-style procedure with (i) selective sample filtering to use only confident predictions, (ii) closed-form Bayesian updates for prototypes and priors anchored by text and Dirichlet priors, (iii) decoupled temperatures for prediction vs. calibration, and (iv) lightweight guards (norm clipping, prior KL constraints, smoothed temperature) to prevent drift in long streams. Across large-scale cross-domain and OOD benchmarks (PACS, Office-Home, DomainNet, Terra Incognita, ImageNet-R/A/V2/Sketch; ~726K test samples) and strong TTA baselines including Tent, T3A, CoTTA, SAR, Tip-Adapter, and FreeTTA, UL-TTA consistently improves top-1 accuracy (e.g., +4.7 points over zero-shot CLIP on average) while reducing ECE by 20-30%, with less than 8% latency overhead. Long-stream experiments up to 200K samples show no collapse. Our results demonstrate that logit-level Bayesian adaptation is sufficient to obtain state-of-the-art accuracy-calibration trade-offs for VLMs under domain shift, without updating any backbone parameters.

Ultra-Light Test-Time Adaptation for Vision--Language Models

TL;DR

Abstract

Ultra-Light Test-Time Adaptation for Vision--Language Models

TL;DR

Abstract

Paper Structure

Table of Contents