ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

Keisuke Sugiura; Hiroki Matsutani

ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

Keisuke Sugiura, Hiroki Matsutani

TL;DR

This work tackles the memory bottleneck of training deep networks on edge devices by introducing ElasticZO, a hybrid zeroth-order/backpropagation method that trains the majority of layers with ZO while keeping the final layers in BP to boost accuracy. It further extends this approach to ElasticZO-INT8, enabling fully integer-arithmetic training for 8-bit quantized networks through a novel integer-gradient mechanism based on the cross-entropy loss. Empirical results on MNIST, Fashion-MNIST, and ModelNet40 show that ElasticZO narrows the BP-ZO accuracy gap with minimal memory overhead, and ElasticZO-INT8 delivers substantial memory and speed gains with negligible accuracy loss. The methods are demonstrated to be viable for fine-tuning and full training on resource-constrained devices, highlighting the practical potential of zeroth-order optimization in on-device learning.

Abstract

Zeroth-order (ZO) optimization is being recognized as a simple yet powerful alternative to standard backpropagation (BP)-based training. Notably, ZO optimization allows for training with only forward passes and (almost) the same memory as inference, making it well-suited for edge devices with limited computing and memory resources. In this paper, we propose ZO-based on-device learning (ODL) methods for full-precision and 8-bit quantized deep neural networks (DNNs), namely ElasticZO and ElasticZO-INT8. ElasticZO lies in the middle between pure ZO- and pure BP-based approaches, and is based on the idea to employ BP for the last few layers and ZO for the remaining layers. ElasticZO-INT8 achieves integer arithmetic-only ZO-based training for the first time, by incorporating a novel method for computing quantized ZO gradients from integer cross-entropy loss values. Experimental results on the classification datasets show that ElasticZO effectively addresses the slow convergence of vanilla ZO and shrinks the accuracy gap to BP-based training. Compared to vanilla ZO, ElasticZO achieves 5.2-9.5% higher accuracy with only 0.072-1.7% memory overhead, and can handle fine-tuning tasks as well as full training. ElasticZO-INT8 further reduces the memory usage and training time by 1.46-1.60x and 1.38-1.42x without compromising the accuracy. These results demonstrate a better tradeoff between accuracy and training cost compared to pure ZO- and BP-based approaches, and also highlight the potential of ZO optimization in on-device learning.

ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

TL;DR

Abstract

Paper Structure (17 sections, 15 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 17 sections, 15 equations, 7 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Zeroth-Order Optimization
MeZO: Memory-Efficient Zeroth-Order Optimizer
Method
Memory-Efficiency of ElasticZO
INT8 Training with ElasticZO-INT8
Zeroth-order Gradient Estimation with Integer Arithmetic
Memory-Efficiency of ElasticZO-INT8
Evaluation
Experimental Setup
Details of DNN Models and Training
Accuracy on Classification Datasets
Memory-Efficiency
...and 2 more sections

Figures (7)

Figure 1: Overview of ElasticZO (top: LeNet-5, bottom: PointNet). ElasticZO trains the first $C$ layers with ZO optimization and the last $L - C$ layers with BP. Red and yellow rectangles denote layers with and without trainable parameters.
Figure 2: Training and test loss curves of LeNet-5 (FP32; left: MNIST, right: Fashion-MNIST).
Figure 3: Training and test loss curves of LeNet-5 (INT8; left: MNIST, right: Fashion-MNIST).
Figure 4: Memory usage breakdown of LeNet-5 (FP32; left: $B = 32$, right: $B = 256$).
Figure 5: Memory usage breakdown of LeNet-5 (INT8; left: $B = 32$, right: $B = 256$).
...and 2 more figures

ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

TL;DR

Abstract

ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)