Table of Contents
Fetching ...

Poor Man's Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach

Yequan Zhao, Hai Li, Ian Young, Zheng Zhang

TL;DR

A BP-free training scheme on an MCU is presented, which makes edge training hardware design as easy as inference hardware design and adopts a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme.

Abstract

Back propagation (BP) is the default solution for gradient computation in neural network training. However, implementing BP-based training on various edge devices such as FPGA, microcontrollers (MCUs), and analog computing platforms face multiple major challenges, such as the lack of hardware resources, long time-to-market, and dramatic errors in a low-precision setting. This paper presents a simple BP-free training scheme on an MCU, which makes edge training hardware design as easy as inference hardware design. We adopt a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme. We further employ a few dimension reduction methods (e.g., node perturbation, sparse training) to improve the convergence of zeroth-order training. Experiment results show that our BP-free training achieves comparable performance as BP-based training on adapting a pre-trained image classifier to various corrupted data on resource-constrained edge devices (e.g., an MCU with 1024-KB SRAM for dense full-model training, or an MCU with 256-KB SRAM for sparse training). This method is most suitable for application scenarios where memory cost and time-to-market are the major concerns, but longer latency can be tolerated.

Poor Man's Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach

TL;DR

A BP-free training scheme on an MCU is presented, which makes edge training hardware design as easy as inference hardware design and adopts a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme.

Abstract

Back propagation (BP) is the default solution for gradient computation in neural network training. However, implementing BP-based training on various edge devices such as FPGA, microcontrollers (MCUs), and analog computing platforms face multiple major challenges, such as the lack of hardware resources, long time-to-market, and dramatic errors in a low-precision setting. This paper presents a simple BP-free training scheme on an MCU, which makes edge training hardware design as easy as inference hardware design. We adopt a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme. We further employ a few dimension reduction methods (e.g., node perturbation, sparse training) to improve the convergence of zeroth-order training. Experiment results show that our BP-free training achieves comparable performance as BP-based training on adapting a pre-trained image classifier to various corrupted data on resource-constrained edge devices (e.g., an MCU with 1024-KB SRAM for dense full-model training, or an MCU with 256-KB SRAM for sparse training). This method is most suitable for application scenarios where memory cost and time-to-market are the major concerns, but longer latency can be tolerated.

Paper Structure

This paper contains 32 sections, 16 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a): Overview of BP-free training framework. A quantized inference engine is easily converted to a training engine by adding control unit and repeatedly calling the inference accelerator. (b): Training memory comparison of different training methods. The numbers are measured with MCUNet-in1 lin2020mcunet, batch size 1, and resolution 128$\times$ 128.
  • Figure 2: Comparison between first-order (FO) optimization and ZO optimization. FO optimization converges faster as it utilizes exact gradient from BP to update model parameters. ZO optimization, on the other hand, uses only forward function queries to estimate the gradients. ZO method converges more slowly due to the large variance of gradient estimation, but it is much more memory-efficient, since no extra computation graph is needed.
  • Figure 3: (a) Data flow of weight perturbation. (b) Data flow node perturbation.
  • Figure 4: Top: The dimension of weights/nodes at each layer. Bottom: The cosine similarity between zeroth-order gradient estimation and the first-order gradient computed by back-propagation at each layer.
  • Figure 6: Overview of BP-free training framework on MCU.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 2.1: Randomized Gradient Estimator, RGE
  • Definition 2.2: ZO-SGD