Table of Contents
Fetching ...

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

Mark Deutel, Frank Hannig, Christopher Mutschler, Jürgen Teich

TL;DR

This work presents a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates and provides insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.

Abstract

On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

TL;DR

This work presents a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates and provides insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.

Abstract

On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.
Paper Structure (14 sections, 7 equations, 9 figures, 4 tables)

This paper contains 14 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Data dependencies between forward and backward pass of linear and convolutional layers in a DNN.
  • Figure 2: Schematic comparison of a convolutional block with ReLU activation and Batchnorm for Quantization Aware Training (QAT) and Fully Quantized Training (FQT). In FQT the convoltional, Batchnorm and ReLU layers have been folded into a single monolithic QConv layer.
  • Figure 3: Heatmaps of the absolute values of the gradient tensors of the last three linear layers of a DNN trained on the flowers dataset, exemplarily for a training sample after the first epoch (left column) and the tenth epoch (right column) of training.
  • Figure 4: Results of fully quantized on-device transfer learning (blue) compared to floating-point (green) and mixed training (orange). The accuracy results in Fig. \ref{['fig:retrain_results:accuracy']} are averaged over five training runs. We also show baseline results trained on a GPU-based server in red. The latency results in Fig. \ref{['fig:retrain_results:latency']} were measured on the IMXRT2062 MCU and are averaged over 1000 consecutive training steps. In both plots, we denote the standard deviation with black bars. Fig. \ref{['fig:retrain_results:memory_dynamic']} and \ref{['fig:retrain_results:memory_static']} show the memory utilization for Flash and RAM for all datasets as returned by our deployment framework. We have marked the memory constraints of the different MCUs with red dashed lines where relevant, see also Tab. \ref{['tab:mcus']}.
  • Figure 5: Comparison of latency and energy for transfer learning on the CWRU and Daliac datasets for the three MCU platforms considered. All results are averaged over 1000 consecutive training steps, with the standard deviation shown as black bars.
  • ...and 4 more figures