DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training

Aochuan Chen; Yimeng Zhang; Jinghan Jia; James Diffenderfer; Jiancheng Liu; Konstantinos Parasyris; Yihua Zhang; Zheng Zhang; Bhavya Kailkhura; Sijia Liu

DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training

Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, Sijia Liu

TL;DR

DeepZero is developed, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations, and a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE.

Abstract

Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinatewise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsityinduced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box. Codes are available at https://github.com/OPTML-Group/DeepZero.

DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training

TL;DR

Abstract

Paper Structure (22 sections, 11 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 22 sections, 11 equations, 14 figures, 5 tables, 2 algorithms.

Introduction
Related Work
ZO Optimization through Function Value-based Gradient Estimation: Randomized or Coordinate-wise?
Sparsity-Assisted ZO Training: A Pruning Lens and Beyond
Improving Scalability: Feature Reuse & Forward Parallel
Experiments
Image classification task
Other black-box applications
Conclusion
Remark on convergence rate.
The Simple CNN Architecture Considered for Training w/ CGE vs. RGE
Computation Time Comparison between RGE and CGE
Performance of Model Pruning via ZO-GraSP
Algorithm Details
ZO-GraSP-oriented-LPR-guided ZO training
...and 7 more sections

Figures (14)

Figure 1: Overview of our DeepZero framework. A: ZO gradient estimation via model queries (Sec. \ref{['sec: ZO_opt_backgroup']}). B: Model pruning guides gradient sparsity (Sec. \ref{['sec: sparse_ZO_train']}). C: Acceleration by parallelization and feature reuse (Sec. \ref{['sec: acceleration']}). D: DeepZero comparison with the computational graph free baseline Pattern Search chiang2023loss and computational graph dependent methods without BP, Align-Ada boopathy2022train, LG-FG-A and FG-W ren2022scaling, on CIFAR-10.
Figure 2: Performance comparison of training a simple CNN with varying numbers of parameters on CIFAR-10 using different training methods.
Figure 3: Computation cost of CGE-based ZO training w/ feature reuse vs. w/o feature reuse. The setup follows Fig. \ref{['fig: Time_CGE_RGE']}.
Figure 4: Comparison between DeepZero and FO training baselines on a ResNet-20 for CIFAR-10. We report the mean and standard deviation of 3 independent runs for each experiment.
Figure 5: Comparison of DeepZero and Pattern Search on ResNet-20 for CIFAR-10 with varying dataset sizes. All experiments are done on a single NVIDIA A6000 GPU.
...and 9 more figures

DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training

TL;DR

Abstract

DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training

Authors

TL;DR

Abstract

Table of Contents

Figures (14)