Table of Contents
Fetching ...

TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo

TL;DR

TinyTrain addresses the challenge of on-device training under data scarcity and tight resource limits by a two-stage approach: offline few-shot learning–based pre-training to establish a robust global representation, followed by online task-adaptive sparse updates guided by a Fisher-information–based multi-objective criterion. This enables dynamic selection of layers and channels to update within memory and compute budgets, achieving higher accuracy than full-network fine-tuning while dramatically reducing backward-pass memory and MACs. Across MCUNet, MobileNetV2, and ProxylessNASNet on nine cross-domain datasets, TinyTrain delivers up to 3.6–5.0 percentage points higher accuracy and up to 1,098× memory savings and 7.68× compute reductions, with end-to-end edge training completed in about 10 minutes on MCU-grade devices. The combination of FSL pre-training and per-task sparse adaptation makes on-device training feasible for real-world edge applications, enabling privacy-preserving personalization without prohibitive energy or memory costs.

Abstract

On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.

TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

TL;DR

TinyTrain addresses the challenge of on-device training under data scarcity and tight resource limits by a two-stage approach: offline few-shot learning–based pre-training to establish a robust global representation, followed by online task-adaptive sparse updates guided by a Fisher-information–based multi-objective criterion. This enables dynamic selection of layers and channels to update within memory and compute budgets, achieving higher accuracy than full-network fine-tuning while dramatically reducing backward-pass memory and MACs. Across MCUNet, MobileNetV2, and ProxylessNASNet on nine cross-domain datasets, TinyTrain delivers up to 3.6–5.0 percentage points higher accuracy and up to 1,098× memory savings and 7.68× compute reductions, with end-to-end edge training completed in about 10 minutes on MCU-grade devices. The combination of FSL pre-training and per-task sparse adaptation makes on-device training feasible for real-world edge applications, enabling privacy-preserving personalization without prohibitive energy or memory costs.

Abstract

On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.
Paper Structure (35 sections, 5 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 5 equations, 16 figures, 11 tables, 1 algorithm.

Figures (16)

  • Figure 1: Cross-domain accuracy (y-axis) and compute cost in MAC count (x-axis) of TinyTrain and existing methods, targeting ProxylessNASNet on Meta-Dataset. The radius of the circles and the corresponding text denote the increase in the memory footprint of each baseline over TinyTrain. The dotted line represents the accuracy without on-device training.
  • Figure 2: Overview of TinyTrain. It consists of (1) the offline pre-training and (2) the online adaptive learning stages. In (1), TinyTrain pre-trains and meta-trains DNNs to improve the attainable accuracy when only a few data are available for adaptation. Then, in (2), TinyTrain performs task-adaptive sparse update based on the multi-objective criterion and dynamic layer/channel selection that co-optimises both memory and computations.
  • Figure 3: Memory- and compute-aware analysis of MCUNet by updating four different channel ratios on each layer. (a) Accuracy gain per layer is generally highest on the first layer of each block. (b) Accuracy gain per parameter of each layer is higher on the second layer of each block. (c) Accuracy gain per MACs of each layer has peaked on the second layer of each block. These observations show accuracy, memory footprint, and computes in a trade-off relation.
  • Figure 4: The pairwise comparison between our dynamic channel selection and static channel selections (i.e. Random and L2-Norm) on MCUNet. The dynamic channel selection consistently outperforms static channel selections as the accuracy gain per layer differs by up to 8%. Also, the gap between dynamic and static channel selections increases as fewer channels are selected for updates.
  • Figure 5: End-to-end latency and energy consumption of the on-device training methods on three architectures.
  • ...and 11 more figures