Table of Contents
Fetching ...

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering

Alexander Murphy, Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh

TL;DR

NEO addresses distribution shift in vision transformers by re-centering test-time embeddings at the origin using a global centroid estimate, an approach grounded in neural-collapse geometry. It is hyperparameter-free, optimization-free, and incurs negligible add-on compute, achieving higher accuracy and better calibration across multiple datasets (ImageNet-C, CIFAR-10-C, ImageNet-R, ImageNet-Sketch) and ViT sizes, including strong edge-device performance. A simple replacement of the final Linear layer with a lightweight NEO mechanism enables adaptation from as few as 1 sample or 1 class, with a continual variant for evolving shifts. Together, these results advance practical, resources-efficient TTA and provide insight into latent-space structure under domain shift.

Abstract

Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering

TL;DR

NEO addresses distribution shift in vision transformers by re-centering test-time embeddings at the origin using a global centroid estimate, an approach grounded in neural-collapse geometry. It is hyperparameter-free, optimization-free, and incurs negligible add-on compute, achieving higher accuracy and better calibration across multiple datasets (ImageNet-C, CIFAR-10-C, ImageNet-R, ImageNet-Sketch) and ViT sizes, including strong edge-device performance. A simple replacement of the final Linear layer with a lightweight NEO mechanism enables adaptation from as few as 1 sample or 1 class, with a continual variant for evolving shifts. Together, these results advance practical, resources-efficient TTA and provide insight into latent-space structure under domain shift.

Abstract

Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.

Paper Structure

This paper contains 44 sections, 4 theorems, 9 equations, 35 figures, 4 tables, 2 algorithms.

Key Result

Proposition 4.1

Consider a network $f$ with a linear classifier. Assume the model is trained to neural collapse with cross-entropy loss, weight regularization and uniformly distributed classes. Then given ${\bm{w}}_c$, the classifier weight vector corresponding to class $c$, we have

Figures (35)

  • Figure 1: Elegant adoption: NEO can be added by replacing the nn.Linear with our custom layer.
  • Figure 2: (a) Given a domain shifted sample, $\tilde{{\bm{x}}}$, we encode it to $h(\tilde{{\bm{x}}})$ and shift it using a single shared vector $\Delta$. The shifted representation is closer to the embedding of the corresponding clean sample (unknown), $h({\bm{x}})$, resulting in more accurate predictions. (b) Runtime (x axis), accuracy (y axis), and memory usage (point radius) of TTA methods for ViT-Base on 15 corruption from ImageNet-C evaluated on 512 samples per corruption. NEO outperforms all methods in terms of runtime, accuracy, and memory.
  • Figure 3: (a) Cumulative frequency of highest magnitude dimension in $h({\bm{x}}) - h(\tilde{{\bm{x}}})$ over 50000 samples (showing 250 out of 768 dimensions). A small number of dimensions account for the largest magnitude of the difference between source and corrupted embeddings. (b) Cosine similarities and difference of L2 norms between source embeddings and (adjusted) corrupted embeddings (i.e. first row contains average of $cos(h({\bm{x}}), h(\tilde{{\bm{x}}}))$ and average of $|\|h({\bm{x}})\| - \|h(\tilde{{\bm{x}}})\||$). Embeddings are taken from ImageNet-C severity level 5 Gaussian Noise, on ViT-Base model (pre-trained on ImageNet). Values are averaged over 50000 samples.
  • Figure 4: (a) Accuracy increase (%) and (b) ECE change compared to no-adaptation for ViT-S, ViT-B and ViT-L on ImageNet-C, CIFAR-10-C, ImageNet-Sketch and ImageNet-Rendition. Accuracy is taken for the whole dataset and no confidence intervals signify a 95% confidence interval of less than 0.05 for accuracy and less than 0.005 for ECE. (c) ECE scores for ViT-S on ImageNet-C averaged over the whole dataset, 15 corruptions and multiple runs.
  • Figure 5: (a) Accuracy (%) for ViT-B on ImageNet-C under varying number of samples to adapt with. (b) Accuracy (%) for ViT-B on ImageNet-C under varying number of classes to adapt with (50 samples used to adapt in total). Accuracy is calculated on samples not used for adaptation except for 50,000 samples. (c) Accuracy increase (%) for continual adaptation, adapting on 15 randomly ordered corruptions from ImageNet-C with 512 samples from each.
  • ...and 30 more figures

Theorems & Definitions (7)

  • Proposition 4.1
  • Proposition 4.2
  • proof
  • Proposition A.1
  • proof
  • Proposition
  • proof