ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization
Keisuke Sugiura, Hiroki Matsutani
TL;DR
This work tackles the memory bottleneck of training deep networks on edge devices by introducing ElasticZO, a hybrid zeroth-order/backpropagation method that trains the majority of layers with ZO while keeping the final layers in BP to boost accuracy. It further extends this approach to ElasticZO-INT8, enabling fully integer-arithmetic training for 8-bit quantized networks through a novel integer-gradient mechanism based on the cross-entropy loss. Empirical results on MNIST, Fashion-MNIST, and ModelNet40 show that ElasticZO narrows the BP-ZO accuracy gap with minimal memory overhead, and ElasticZO-INT8 delivers substantial memory and speed gains with negligible accuracy loss. The methods are demonstrated to be viable for fine-tuning and full training on resource-constrained devices, highlighting the practical potential of zeroth-order optimization in on-device learning.
Abstract
Zeroth-order (ZO) optimization is being recognized as a simple yet powerful alternative to standard backpropagation (BP)-based training. Notably, ZO optimization allows for training with only forward passes and (almost) the same memory as inference, making it well-suited for edge devices with limited computing and memory resources. In this paper, we propose ZO-based on-device learning (ODL) methods for full-precision and 8-bit quantized deep neural networks (DNNs), namely ElasticZO and ElasticZO-INT8. ElasticZO lies in the middle between pure ZO- and pure BP-based approaches, and is based on the idea to employ BP for the last few layers and ZO for the remaining layers. ElasticZO-INT8 achieves integer arithmetic-only ZO-based training for the first time, by incorporating a novel method for computing quantized ZO gradients from integer cross-entropy loss values. Experimental results on the classification datasets show that ElasticZO effectively addresses the slow convergence of vanilla ZO and shrinks the accuracy gap to BP-based training. Compared to vanilla ZO, ElasticZO achieves 5.2-9.5% higher accuracy with only 0.072-1.7% memory overhead, and can handle fine-tuning tasks as well as full training. ElasticZO-INT8 further reduces the memory usage and training time by 1.46-1.60x and 1.38-1.42x without compromising the accuracy. These results demonstrate a better tradeoff between accuracy and training cost compared to pure ZO- and BP-based approaches, and also highlight the potential of ZO optimization in on-device learning.
