Table of Contents
Fetching ...

ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization

Ronghao Zhang, Shuaicheng Niu, Qi Deng, Yanjie Dong, Jian Chen, Runhao Zeng

Abstract

Test-time adaptation (TTA) aims to improve model robustness under distribution shifts by adapting to unlabeled test data, but most existing methods rely on backpropagation (BP), which is computationally costly and incompatible with non-differentiable models such as quantized models, limiting practical deployment on numerous edge devices. Recent BP-free approaches alleviate overhead but remain either architecture-specific or limited in optimization capacity to handle high-dimensional models. We propose ZOTTA, a fully BP-free TTA framework that performs efficient adaptation using only forward passes via Zeroth-Order Optimization (ZOO). While ZOO is theoretically appealing, naive application leads to slow convergence under high-dimensional parameter spaces and unstable optimization due to the lack of labels. ZOTTA overcomes these challenges through 1) Distribution-Robust Layer Selection, which automatically identifies and freezes layers that already extract distribution-invariant features, updating only domain-sensitive layers to reduce the optimization dimensionality and accelerate convergence; 2) Spatial Feature Aggregation Alignment, which stabilizes ZOO by aligning globally aggregated spatial features between source and target to reduce gradient variance. Together, these components enable architecture-agnostic and stable BP-free adaptation. Extensive experiments on ImageNet-C/R/Sketch/A show that ZOTTA outperforms or matches BP-based methods, e.g., it reduces memory usage by 84% and improves accuracy by 3.9% over SAR on ImageNet-C.

ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization

Abstract

Test-time adaptation (TTA) aims to improve model robustness under distribution shifts by adapting to unlabeled test data, but most existing methods rely on backpropagation (BP), which is computationally costly and incompatible with non-differentiable models such as quantized models, limiting practical deployment on numerous edge devices. Recent BP-free approaches alleviate overhead but remain either architecture-specific or limited in optimization capacity to handle high-dimensional models. We propose ZOTTA, a fully BP-free TTA framework that performs efficient adaptation using only forward passes via Zeroth-Order Optimization (ZOO). While ZOO is theoretically appealing, naive application leads to slow convergence under high-dimensional parameter spaces and unstable optimization due to the lack of labels. ZOTTA overcomes these challenges through 1) Distribution-Robust Layer Selection, which automatically identifies and freezes layers that already extract distribution-invariant features, updating only domain-sensitive layers to reduce the optimization dimensionality and accelerate convergence; 2) Spatial Feature Aggregation Alignment, which stabilizes ZOO by aligning globally aggregated spatial features between source and target to reduce gradient variance. Together, these components enable architecture-agnostic and stable BP-free adaptation. Extensive experiments on ImageNet-C/R/Sketch/A show that ZOTTA outperforms or matches BP-based methods, e.g., it reduces memory usage by 84% and improves accuracy by 3.9% over SAR on ImageNet-C.
Paper Structure (33 sections, 6 equations, 13 figures, 14 tables, 1 algorithm)

This paper contains 33 sections, 6 equations, 13 figures, 14 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of Naive ZOO vs. Ours with varying #forward passes using ViT-Base on ImageNet-C (Gauss, level 5).
  • Figure 2: Analysis of ZOO gradient quality compared to the First-Order (FO) gradient on ImageNet-C, showing (Left) their respective gradient norms and (Right) the cosine similarity between the ZOO gradients (Naive ZOO, Ours) and the FO gradient.
  • Figure 3: An overview of the proposed ZOTTA framework, which enables BP-free TTA through two main components: 1) Distribution-Robust Layer Selection: Before TTA, we identify distribution-invariant layers and freeze them, updating only distribution-sensitive layers to reduce the optimization dimensionality. 2) Zeroth-Order Gradient Estimation via Spatial Feature Aggregation Alignment: For each TTA step, we inject Gaussian perturbations into the selected parameters, aggregate spatial/token features into global descriptors, and align their statistics with the source domain. This alignment objective provides a more stable zeroth-order gradient obtained from the loss difference of two-sided perturbed forward passes.
  • Figure 4: Comparison of our ZOTTA vs. FOA under different numbers of forward passes on ImageNet-C (avg acc over level 5 corruptions) using different models.
  • Figure 5: Per-layer clustering purity of ViT-Base.
  • ...and 8 more figures