Table of Contents
Fetching ...

Online Gaussian Test-Time Adaptation of Vision-Language Models

Clément Fuchs, Maxime Zanella, Christophe De Vleeschouwer

TL;DR

This work studies online test-time adaptation for vision-language models and introduces Online Gaussian Adaptation (OGA), which models class-conditional visual features with Gaussian distributions and fuses these likelihoods with zero-shot priors via a principled MAP rule using fixed hyper-parameters. OGA maintains a compact cache, updates class means and a shared covariance online, and selects low-entropy samples from the stream to improve adaptation, while also introducing the ETA metric to capture worst-case performance across runs. Across a broad set of datasets, OGA achieves strong average gains and favorable tail performance, and it can be effectively applied on top of few-shot methods, underscoring the practical benefits of combining OTTA with offline adaptation. The authors also advocate for more rigorous evaluation practices in OTTA, highlighting run-to-run variability and proposing ETA as a more informative metric for practical deployments.

Abstract

Online test-time adaptation (OTTA) of vision-language models (VLMs) has recently garnered increased attention to take advantage of data observed along a stream to improve future predictions. Unfortunately, existing methods rely on dataset-specific hyperparameters, significantly limiting their adaptability to unseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel method that models the likelihoods of visual features using Gaussian distributions and incorporates zero-shot priors into an interpretable Maximum A Posteriori (MAP) estimation framework with fixed hyper-parameters across all datasets. We demonstrate that OGA outperforms state-of-the-art methods on most datasets and runs. Additionally, we show that combining OTTA with popular few-shot techniques (a practical yet overlooked setting in prior research) is highly beneficial. Furthermore, our experimental study reveals that common OTTA evaluation protocols, which average performance over at most three runs per dataset, are inadequate due to the substantial variability observed across runs for all OTTA methods. Therefore, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. We hope these contributions will encourage more rigorous and diverse evaluation practices in the OTTA community. Code is available at https://github.com/cfuchs2023/OGA .

Online Gaussian Test-Time Adaptation of Vision-Language Models

TL;DR

This work studies online test-time adaptation for vision-language models and introduces Online Gaussian Adaptation (OGA), which models class-conditional visual features with Gaussian distributions and fuses these likelihoods with zero-shot priors via a principled MAP rule using fixed hyper-parameters. OGA maintains a compact cache, updates class means and a shared covariance online, and selects low-entropy samples from the stream to improve adaptation, while also introducing the ETA metric to capture worst-case performance across runs. Across a broad set of datasets, OGA achieves strong average gains and favorable tail performance, and it can be effectively applied on top of few-shot methods, underscoring the practical benefits of combining OTTA with offline adaptation. The authors also advocate for more rigorous evaluation practices in OTTA, highlighting run-to-run variability and proposing ETA as a more informative metric for practical deployments.

Abstract

Online test-time adaptation (OTTA) of vision-language models (VLMs) has recently garnered increased attention to take advantage of data observed along a stream to improve future predictions. Unfortunately, existing methods rely on dataset-specific hyperparameters, significantly limiting their adaptability to unseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel method that models the likelihoods of visual features using Gaussian distributions and incorporates zero-shot priors into an interpretable Maximum A Posteriori (MAP) estimation framework with fixed hyper-parameters across all datasets. We demonstrate that OGA outperforms state-of-the-art methods on most datasets and runs. Additionally, we show that combining OTTA with popular few-shot techniques (a practical yet overlooked setting in prior research) is highly beneficial. Furthermore, our experimental study reveals that common OTTA evaluation protocols, which average performance over at most three runs per dataset, are inadequate due to the substantial variability observed across runs for all OTTA methods. Therefore, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. We hope these contributions will encourage more rigorous and diverse evaluation practices in the OTTA community. Code is available at https://github.com/cfuchs2023/OGA .
Paper Structure (41 sections, 11 equations, 3 figures, 12 tables)

This paper contains 41 sections, 11 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: The presented results are averaged over 100 runs. We propose the Expected Tail Accuracy (ETA), i.e., the average over the 10% worst runs, in solid red line. Our method named OGA not only significantly outperforms competitors on average but also has an ETA exceeding their average accuracy on several datasets (e.g., ImageNet and Pets). See Table \ref{['tab:zero_shot_ACC_and_ETL']} for more detailed results.
  • Figure 2: For each dataset, we show the percentage of runs for which our method OGA achieves a higher accuracy than our competitors DMN and TDA. The experimental setting is the same as the one used for generating the results of Table \ref{['tab:zero_shot_ACC_and_ETL']}.
  • Figure 3: We show the dynamic of the accuracy of our OGA method as it starts from an empty cache, averaged on 100 runs. At regular intervals, we evaluate the accuracy of OGA on the complete test set.