Table of Contents
Fetching ...

FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

Xingyu Wang, Tao Wang

TL;DR

To ensure efficient and stable adaptation over the out-of-distribution data stream, a dynamically decaying perturbation scale is introduced during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption.

Abstract

Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.

FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

TL;DR

To ensure efficient and stable adaptation over the out-of-distribution data stream, a dynamically decaying perturbation scale is introduced during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption.

Abstract

Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.
Paper Structure (59 sections, 38 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 59 sections, 38 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Convergence Curves of Forward-Only Test-Time Adaptation Algorithms. Average ACC@1 (%) of various Test-Time Adaptation (TTA) methods on ImageNet-C (level 5) versus adaptation time (s). Per original settings, FOA and ZOA use 28 forward propagation (FP), while our FOZO method employs 26 FP. FOZO (Dynamic) consistently surpasses FOZO (Base) in accuracy, demonstrating the effectiveness of our dynamic perturbation strategy. Furthermore, FOZO achieves superior performance and faster convergence than FOA and ZOA. Notably, FOZO reaches 65% ACC@1 in only 66% of the runtime required by FOA and ZOA, respectively.
  • Figure 2: Overview of our proposed FOZO. (a) Forward-Only Zeroth-Order Optimization. This diagram illustrates how FOZO adaptively updates learnable visual prompts ($\mathbf{P}$) at test time using a zeroth-order gradient estimation. For each test batch, the model performs two forward passes by perturbing the prompt $\mathbf{P}$ in positive ($\mathbf{P} + \epsilon_t \mathbf{Z}$) and negative ($\mathbf{P} - \epsilon_t \mathbf{Z}$) directions. Here, $\mathbf{Z}$ is a perturbation vector generated from a random seed $s$, and $\epsilon_t$ is a dynamically adjusted perturbation step size. (b) Deep-Shallow Aligning Loss Function. We compute the mean ($\mu^T$) and standard deviation ($\sigma^T$) of the [CLS] token activations by grouping shallow ($e_1^0, \ldots, e_{N/2}^0$) and deep ($e_{N/2+1}^0, \ldots, e_N^0$) layers of the model, and align them with pre-computed source domain statistics ($\mu^S, \sigma^S$).
  • Figure 3: Ablation Study of FOZO Hyperparameters on ImageNet-C (Gaussian Noise, level 5).
  • Figure E.1: Mixed shift: Performance comparison on ImageNet-C (5K, level 5).