Table of Contents
Fetching ...

BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Taolin Zhang, Jinpeng Wang, Hang Guo, Tao Dai, Bin Chen, Shu-Tao Xia

TL;DR

This paper breaks down the design of existing popular training-required and training-free test-time adaptation methods and bridges the gap between them within a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples.

Abstract

Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves. In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework. Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself. We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.

BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

TL;DR

This paper breaks down the design of existing popular training-required and training-free test-time adaptation methods and bridges the gap between them within a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples.

Abstract

Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves. In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework. Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself. We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.

Paper Structure

This paper contains 25 sections, 5 theorems, 51 equations, 6 figures, 19 tables.

Key Result

Proposition 1

(Informal) Given $n$ samples $\{(x_i, y_i)\}_{i=1}^{n}$ with a freeze encoder $g$ that effectively performing feature clustering with respect to labels, the gradient descent optimization direction of the classifier's weights based on cross-entropy generally tends towards making predictions using the

Figures (6)

  • Figure 1: (a) Existing training-required TTA methods utilize self-supervised objective like entropy minimization for better generalization. (b) Existing training-free TTA methods perform feature retrieval on the historical samples to adjust the model prediction. (c) Performance comparison on the Out-of-Distribution benchmark and Cross-Datasets benchmark.
  • Figure 2: Connection between cross-entropy optimization and cache classifier over well-clustered samples with a frozen feature encoder. With optimization of cross-entropy, samples will pull the classifier weights closer of the same class while pushing them away from different class weights. Since the feature space is well-clustered, the classifier weights will ultimately converge near the feature center of the samples. Finally, the optimal classifier achieved through cross-entropy minimization will exhibit similar behavior with the cache classifier.
  • Figure 3: Overall architecture of BoostAdapter. BoostAdapter leverages knowledge from the target domain and employs self-bootstrapping with historical and boosting samples in the boosting cache, respectively.
  • Figure 4: Ablation studies of (a) number of augmented views to generate boosting samples (b) different adaptation methods and (c) total shot capacity of the cache.
  • Figure 5: Qualitative results. The model predictions are provided below the images. Boosting samples with low entropy improves information extraction from the test sample and helps the model to distinguish better.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1
  • Definition 2
  • Definition 3
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Definition 4
  • Proposition 4
  • Proposition 5