Dynamic Gradient Sparse Update for Edge Training
I-Hsuan Li, Tian-Sheuan Chang
TL;DR
This work tackles the memory bottleneck of on-device transfer learning by combining offline pruning of a pre-trained model with a dynamic gradient sparse update strategy that selectively updates channels and layers within a tight memory budget. It leverages DepGraph-based channel pruning and pattern-based activation pruning, plus a dynamic, stage-aware update schedule that traverses most parameters over time while freezing front layers to protect general features. In experiments transferring MobileNetV2 from ImageNet to CIFAR-10, the approach achieves 85.77% accuracy while updating only 2% of convolution weights and saving about 98% of feature memory compared with dense training, all within a $256$KB on-chip budget. The results demonstrate practical feasibility for edge-training and offer a principled framework for memory-efficient transfer learning on resource-constrained devices.
Abstract
Training on edge devices enables personalized model fine-tuning to enhance real-world performance and maintain data privacy. However, the gradient computation for backpropagation in the training requires significant memory buffers to store intermediate features and compute losses. This is unacceptable for memory-constrained edge devices such as microcontrollers. To tackle this issue, we propose a training acceleration method using dynamic gradient sparse updates. This method updates the important channels and layers only and skips gradient computation for the less important channels and layers to reduce memory usage for each update iteration. In addition, the channel selection is dynamic for different iterations to traverse most of the parameters in the update layers along the time dimension for better performance. The experimental result shows that the proposed method enables an ImageNet pre-trained MobileNetV2 trained on CIFAR-10 to achieve an accuracy of 85.77\% while updating only 2\% of convolution weights within 256KB on-chip memory. This results in a remarkable 98\% reduction in feature memory usage compared to dense model training.
