TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge
Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo
TL;DR
TinyTrain addresses the challenge of on-device training under data scarcity and tight resource limits by a two-stage approach: offline few-shot learning–based pre-training to establish a robust global representation, followed by online task-adaptive sparse updates guided by a Fisher-information–based multi-objective criterion. This enables dynamic selection of layers and channels to update within memory and compute budgets, achieving higher accuracy than full-network fine-tuning while dramatically reducing backward-pass memory and MACs. Across MCUNet, MobileNetV2, and ProxylessNASNet on nine cross-domain datasets, TinyTrain delivers up to 3.6–5.0 percentage points higher accuracy and up to 1,098× memory savings and 7.68× compute reductions, with end-to-end edge training completed in about 10 minutes on MCU-grade devices. The combination of FSL pre-training and per-task sparse adaptation makes on-device training feasible for real-world edge applications, enabling privacy-preserving personalization without prohibitive energy or memory costs.
Abstract
On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.
