Hierarchical Zero-Order Optimization for Deep Neural Networks

Sansheng Cao; Zhengyu Ma; Yonghong Tian

Hierarchical Zero-Order Optimization for Deep Neural Networks

Sansheng Cao, Zhengyu Ma, Yonghong Tian

TL;DR

This paper tackles the inefficiency of zeroth-order optimization in deep networks by introducing Hierarchical Zeroth-Order (HZO) optimization, which divides the network depth and applies recursive Jacobian-target propagation to deliver updates that are equivalent to Backpropagation (BP) in direction. The key contributions are a proven reduction of query complexity from $O(ML^2)$ to $O(ML \log L)$, a detailed error analysis showing stability near the unitary Lipschitz limit $L_{lip} \approx 1$, and empirical validation on CIFAR-10 and a 10-class ImageNet subset demonstrating competitive accuracy and scalability without full backpropagation. Theoretical results include Theorem 1 (Gradient Equivalence) and a recurrence-based complexity proof, along with an examination of error accumulation (Theorem 3) and the unitary-limit condition. Practically, HZO enables biologically plausible zeroth-order learning to scale to deep architectures, with spatial parallel perturbation further reducing cost for convolutional layers, making non-differentiable or hardware-restricted training more feasible at ImageNet-scale.

Abstract

Zeroth-order (ZO) optimization has long been favored for its biological plausibility and its capacity to handle non-differentiable objectives, yet its computational complexity has historically limited its application in deep neural networks. Challenging the conventional paradigm that gradients propagate layer-by-layer, we propose Hierarchical Zeroth-Order (HZO) optimization, a novel divide-and-conquer strategy that decomposes the depth dimension of the network. We prove that HZO reduces the query complexity from $O(ML^2)$ to $O(ML \log L)$ for a network of width $M$ and depth $L$, representing a significant leap over existing ZO methodologies. Furthermore, we provide a detailed error analysis showing that HZO maintains numerical stability by operating near the unitary limit ($L_{lip} \approx 1$). Extensive evaluations on CIFAR-10 and ImageNet demonstrate that HZO achieves competitive accuracy compared to backpropagation.

Hierarchical Zero-Order Optimization for Deep Neural Networks

TL;DR

, a detailed error analysis showing stability near the unitary Lipschitz limit

, and empirical validation on CIFAR-10 and a 10-class ImageNet subset demonstrating competitive accuracy and scalability without full backpropagation. Theoretical results include Theorem 1 (Gradient Equivalence) and a recurrence-based complexity proof, along with an examination of error accumulation (Theorem 3) and the unitary-limit condition. Practically, HZO enables biologically plausible zeroth-order learning to scale to deep architectures, with spatial parallel perturbation further reducing cost for convolutional layers, making non-differentiable or hardware-restricted training more feasible at ImageNet-scale.

Abstract

for a network of width

and depth

, representing a significant leap over existing ZO methodologies. Furthermore, we provide a detailed error analysis showing that HZO maintains numerical stability by operating near the unitary limit (

). Extensive evaluations on CIFAR-10 and ImageNet demonstrate that HZO achieves competitive accuracy compared to backpropagation.

Paper Structure (37 sections, 28 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 28 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Related Work
ZO in Fine-tuning
ZO Computation Complexity
ZO and Biologically Plausible Algorithms
Method
Problem Description
Hierarchical Zeroth-Order Optimizer (HZO)
Formal Definition of Subnetworks
Recursive Bisection and Target Propagation
Local Learning Rule and Gradient Equivalence
Algorithm
Spatial Parallel Perturbation for Convolutional Layers
Theoretical Analysis
Equivalence to Backpropagation
...and 22 more sections

Figures (5)

Figure 1: Gradient Cosine Similarity vs. Network Depth. We compare the cosine similarity of gradients estimated by HZO and Standard ZO across various depths ($L \in \{16, 32, 64\}$). HZO combined with ResNet maintains near-perfect directional alignment ($\rho \approx 1.0$) even at $L=64$, whereas the cosine similarity of Plain CNNs remains low.
Figure 2: Experimental Results on CIFAR10 Dataset. (Top) The training loss in red color and accuracy in blue color. (Middle) Validation accuracy of HZO method. (Bottom) Cosine similarity between HZO and BP methods.
Figure 3: Experimental Results on ImageNet Dataset. (Top) The training loss in red color and accuracy in blue color. (Middle) Validation accuracy of HZO method. (Bottom) Cosine similarity between HZO and BP methods.
Figure 4: Comparison of computational complexity
Figure 5: Experimental Results with different activation functions and numerical precision

Hierarchical Zero-Order Optimization for Deep Neural Networks

TL;DR

Abstract

Hierarchical Zero-Order Optimization for Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)