Table of Contents
Fetching ...

Model Evolution Under Zeroth-Order Optimization: A Neural Tangent Kernel Perspective

Chen Zhang, Yuxin Cheng, Chenchen Ding, Shuqi Wang, Jingreng Lei, Runsheng Yu, Yik-Chung WU, Ngai Wong

Abstract

Zeroth-order (ZO) optimization enables memory-efficient training of neural networks by estimating gradients via forward passes only, eliminating the need for backpropagation. However, the stochastic nature of gradient estimation significantly obscures the training dynamics, in contrast to the well-characterized behavior of first-order methods under Neural Tangent Kernel (NTK) theory. To address this, we introduce the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates. For linear models, we prove that the expected NZK remains constant throughout training and depends explicitly on the first and second moments of the random perturbation directions. This invariance yields a closed-form expression for model evolution under squared loss. We further extend the analysis to linearized neural networks. Interpreting ZO updates as kernel gradient descent via NZK provides a novel perspective for potentially accelerating convergence. Extensive experiments across synthetic and real-world datasets (including MNIST, CIFAR-10, and Tiny ImageNet) validate our theoretical results and demonstrate acceleration when using a single shared random vector.

Model Evolution Under Zeroth-Order Optimization: A Neural Tangent Kernel Perspective

Abstract

Zeroth-order (ZO) optimization enables memory-efficient training of neural networks by estimating gradients via forward passes only, eliminating the need for backpropagation. However, the stochastic nature of gradient estimation significantly obscures the training dynamics, in contrast to the well-characterized behavior of first-order methods under Neural Tangent Kernel (NTK) theory. To address this, we introduce the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates. For linear models, we prove that the expected NZK remains constant throughout training and depends explicitly on the first and second moments of the random perturbation directions. This invariance yields a closed-form expression for model evolution under squared loss. We further extend the analysis to linearized neural networks. Interpreting ZO updates as kernel gradient descent via NZK provides a novel perspective for potentially accelerating convergence. Extensive experiments across synthetic and real-world datasets (including MNIST, CIFAR-10, and Tiny ImageNet) validate our theoretical results and demonstrate acceleration when using a single shared random vector.
Paper Structure (22 sections, 4 theorems, 73 equations, 18 figures)

This paper contains 22 sections, 4 theorems, 73 equations, 18 figures.

Key Result

Theorem 1

Suppose a linear model $f(\bm{x};\theta)$ is trained using ZO optimization with random direction vectors $\bm{z}\sim\mathcal{N}(\mu_{\bm{z}}\bm{1},\sigma_{\bm{z}}^2\bm{I}_d)$, and the rate of change of $f(\bm{x};\theta)$ w.r.t. $\theta$ is estimated following Eq. eq:parest with $\bm{\zeta}\sim\mathc

Figures (18)

  • Figure 1: Comparison of losses in 2-D tasks under FO and ZO with varying $\sigma_{\bm{z}}$ for sampling. "Independent" indicates that $\bm{z}$ and $\mathcal{\bm{\zeta}}$ are sampled independently, while "Identical" denotes that $\bm{\zeta}$ remains the same as $\bm{z}$.
  • Figure 2: Comparison of losses on CIFAR-10 and Tiny imagenet using the linearized neural network under FO, traditional parametric gradient ZO ("ZO (parametric)") and kernel gradient ZO with identical $\bm{\zeta}$ and $\bm{z}$ ("ZO (kernel)").
  • Figure 3: Comparison of NZK for linearized neural networks in tiny imagenet classification using FO (left) and ZO (right), with identical $\bm{\zeta}$ and $\bm{z}$.
  • Figure 4: Comparison of the final linear models trained with FO and ZO after 16,000 iterations, with different $\sigma_{\bm{z}}$. The vectors $\bm{\zeta}$ and $\bm{z}$ are sampled independently.
  • Figure 5: Evolution of linear models under FO and ZO in a 2-D fitting task. The vectors $\bm{\zeta}$ and $\bm{z}$ are independently sampled. (a) Evolution under FO. (b-d) Evolution under ZO with (b) $\sigma_{\bm{z}}=1$, (c) $\sigma_{\bm{z}}=0.5$, and (d) $\sigma_{\bm{z}}=1.5$.
  • ...and 13 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Corollary 2
  • Proposition 3
  • proof
  • Lemma 6