Table of Contents
Fetching ...

Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models

Khanh-Binh Nguyen, Phuoc-Nguyen Bui, Hyunseung Choo, Duc Thanh Nguyen

TL;DR

This work tackles the problem of vision-language models under distribution shifts by introducing Adaptive Cache Enhancement (ACE), a test-time adaptation framework that builds a robust, class-aware cache. ACE employs class-wise adaptive thresholds, initialized from zero-shot statistics, and refined online via exponential moving averages and exploration to create flexible, per-class decision boundaries. The method combines zero-shot CLIP predictions with a cache-based residual learning objective, including an unsupervised entropy loss and a prototype-alignment loss, to improve robustness across 15 datasets and under natural distribution shifts. Empirical results show that ACE achieves state-of-the-art robustness and cross-dataset generalization while maintaining practical computation time, illustrating its potential for real-world deployment of VLMs in dynamic environments.

Abstract

Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.

Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models

TL;DR

This work tackles the problem of vision-language models under distribution shifts by introducing Adaptive Cache Enhancement (ACE), a test-time adaptation framework that builds a robust, class-aware cache. ACE employs class-wise adaptive thresholds, initialized from zero-shot statistics, and refined online via exponential moving averages and exploration to create flexible, per-class decision boundaries. The method combines zero-shot CLIP predictions with a cache-based residual learning objective, including an unsupervised entropy loss and a prototype-alignment loss, to improve robustness across 15 datasets and under natural distribution shifts. Empirical results show that ACE achieves state-of-the-art robustness and cross-dataset generalization while maintaining practical computation time, illustrating its potential for real-world deployment of VLMs in dynamic environments.

Abstract

Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.

Paper Structure

This paper contains 37 sections, 13 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Noisy CLIP predictions on different views of a "dog". CLIP can be overconfident on the simple views and fail on others.
  • Figure 2: Overview of the ACE Method. We introduce the Class-wise Adaptive Threshold Enhancement module, which refines the threshold online to allow more correct samples but low-confidence samples to be cached. By simultaneously minimizing entropy loss for the prototype residuals and creating highly reliable caches, ACE enhances the overall multimodal generalization and robustness.
  • Figure 3: t-SNE van2008visualizing visualizations of the stored image features in the cache between DPE zhang2024dual and ACE-Entropy.
  • Figure 4: Comparison of cache accuracy vs. test accuracy over four datasets between TDA tda, DPE zhang2024dual, and our ACE-Entropy.