Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line

Eungyeup Kim; Mingjie Sun; Christina Baek; Aditi Raghunathan; J. Zico Kolter

Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line

Eungyeup Kim, Mingjie Sun, Christina Baek, Aditi Raghunathan, J. Zico Kolter

TL;DR

This paper makes a key finding that recent test-time adaptation (TTA) methods not only improve OOD performance, but drastically strengthen the ACL and AGL trends in models, even in shifts where models showed very weak correlations before.

Abstract

Recently, Miller et al. (2021) and Baek et al. (2022) empirically demonstrated strong linear correlations between in-distribution (ID) versus out-of-distribution (OOD) accuracy and agreement. These trends, coined accuracy-on-the-line (ACL) and agreement-on-the-line (AGL), enable OOD model selection and performance estimation without labeled data. However, these phenomena also break for certain shifts, such as CIFAR10-C Gaussian Noise, posing a critical bottleneck. In this paper, we make a key finding that recent test-time adaptation (TTA) methods not only improve OOD performance, but drastically strengthen the ACL and AGL trends in models, even in shifts where models showed very weak correlations before. To analyze this, we revisit the theoretical conditions from Miller et al. (2021) that outline the types of distribution shifts needed for perfect ACL in linear models. Surprisingly, these conditions are satisfied after applying TTA to deep models in the penultimate feature embedding space. In particular, TTA causes the data distribution to collapse complex shifts into those can be expressed by a singular scaling variable in the feature space. Our results show that by combining TTA with AGL-based estimation methods, we can estimate the OOD performance of models with high precision for a broader set of distribution shifts. This lends us a simple system for selecting the best hyperparameters and adaptation strategy without any OOD labeled data.

Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line

TL;DR

Abstract

Paper Structure (31 sections, 1 theorem, 6 equations, 14 figures, 7 tables, 2 algorithms)

This paper contains 31 sections, 1 theorem, 6 equations, 14 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Understanding accuracy and agreement-on-the-line.
Adaptations under distribution shifts and their pitfalls of reliability.
Adaptations lead to stronger Agreement-on-the-Line
Experimental setup
Calculating agreement.
Main observation
Why does adaptations lead to strong linear trends?
Theoretical conditions for linear trends in Gaussian data
Empirical analysis under adaptation to CIFAR10-C
Experiments
OOD Accuracy estimation after adaptation
Unsupervised validation for TTA
Ablation Study: When does TTA not improve linear trends?
...and 16 more sections

Key Result

Theorem 1

[accuracy2021miller, simplified] Under the Gaussian data setup in Equation eq:gaussian_data, across all linear classifiers $f_\theta:x\mapsto \textup{sign}(\theta^\top x)$, the probit-scaled accuracies over P and Q observes perfect linear correlation with a bias of zero and a slope of $\frac{\alpha}

Figures (14)

Figure 1: Linear trends in both accuracy and agreement hold to a substantially stronger degree after applying adaptation methods than before. Each blue and pink dot denotes the accuracy and agreement, followed by the linear fits for each, and $R^2$ is correlation coefficient.
Figure 2: Strong AGL by adaptations with varying hyperparameters, including learning rates, adaptation steps, batch sizes, and early-stopped checkpoints of the ID-trained model. Each blue and pink dot denotes the accuracy and agreement, followed by the linear fits for each.
Figure 3: 2-D visualizations of each TTA baseline's ground truth best OOD accuracy (x-axis) and estimated OOD accuracy of best-ID models (y-axis) marked in cross ($\times)$. Each color denotes different TTA baselines. Circle dot ($\circ$) represents estimates and ground truth accuracies averaged over all hyperparameter values.
Figure 4: The linear trends visualization of shifts that already exhibit strong AGL in Vanilla. Notice that adaptations do not break the linear trends, even when they lead to accuracy drops. We test SHOT, TENT, and ETA on shifts such as CIFAR10.1, ImageNetV2, ImageNet-R, and FMoW-WILDS. Each blue and pink dot denotes the accuracy and agreement, followed by the linear fits for each. The axes are probit scaled.
Figure 5: OOD accuracy estimation results with limited amount of ID data, decreasing from $50\%$ to $1\%$ of entire data pool. We randomly sampled $10$ different subsets for each ratios, and visualize the distribution of MAE (%) results. The results of baseline including ATC, DoC-feat, and Agreement are included for comparison.
...and 9 more figures

Theorems & Definitions (1)

Theorem 1

Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line

TL;DR

Abstract

Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (1)