Table of Contents
Fetching ...

TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge

Run Wang, Victor J. B. Jung, Philip Wiese, Francesco Conti, Alessio Burrello, Luca Benini

TL;DR

TrainDeeploy provides the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks and Transformer models, together with multiple training strategies such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA).

Abstract

On-device tuning of deep neural networks enables long-term adaptation at the edge while preserving data privacy. However, the high computational and memory demands of backpropagation pose significant challenges for ultra-low-power, memory-constrained extreme-edge devices. These challenges are further amplified for attention-based models due to their architectural complexity and computational scale. We present TrainDeeploy, a framework that unifies efficient inference and on-device training on heterogeneous ultra-low-power System-on-Chips (SoCs). TrainDeeploy provides the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks (CNNs) and Transformer models, together with multiple training strategies such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA). On a RISC-V-based heterogeneous SoC, we demonstrate the first end-to-end on-device fine-tuning of a Compact Convolutional Transformer (CCT), achieving up to 11 trained images per second. We show that LoRA reduces dynamic memory usage by 23%, decreases the number of trainable parameters and gradients by 15x, and reduces memory transfer volume by 1.6x compared to full backpropagation. TrainDeeploy achieves up to 4.6 FLOP/cycle on CCT (0.28M parameters, 71-126M FLOPs) and up to 13.4 FLOP/cycle on Deep-AE (0.27M parameters, 0.8M FLOPs), while expanding the scope of prior frameworks to support both CNN and Transformer models with parameter-efficient tuning on extreme-edge platforms.

TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge

TL;DR

TrainDeeploy provides the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks and Transformer models, together with multiple training strategies such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA).

Abstract

On-device tuning of deep neural networks enables long-term adaptation at the edge while preserving data privacy. However, the high computational and memory demands of backpropagation pose significant challenges for ultra-low-power, memory-constrained extreme-edge devices. These challenges are further amplified for attention-based models due to their architectural complexity and computational scale. We present TrainDeeploy, a framework that unifies efficient inference and on-device training on heterogeneous ultra-low-power System-on-Chips (SoCs). TrainDeeploy provides the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks (CNNs) and Transformer models, together with multiple training strategies such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA). On a RISC-V-based heterogeneous SoC, we demonstrate the first end-to-end on-device fine-tuning of a Compact Convolutional Transformer (CCT), achieving up to 11 trained images per second. We show that LoRA reduces dynamic memory usage by 23%, decreases the number of trainable parameters and gradients by 15x, and reduces memory transfer volume by 1.6x compared to full backpropagation. TrainDeeploy achieves up to 4.6 FLOP/cycle on CCT (0.28M parameters, 71-126M FLOPs) and up to 13.4 FLOP/cycle on Deep-AE (0.27M parameters, 0.8M FLOPs), while expanding the scope of prior frameworks to support both CNN and Transformer models with parameter-efficient tuning on extreme-edge platforms.
Paper Structure (26 sections, 6 figures, 2 tables)

This paper contains 26 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Low-Rank Adaptation (lora). (a) Frozen pre-trained weight $W_0$ with trainable low-rank matrices $A$ and $B$. (b) Comparison of memory footprint over time between full-parameter fine-tuning and lora. The stacked areas illustrate how tensors are allocated and released during execution. By reducing gradient storage, lora further lowers the peak memory footprint, enabling training within tight on-chip memory budgets.
  • Figure 2: Overview of the TrainDeeploy framework. (a) Models are defined in PyTorch and exported through onnx. (b) The training engine augments the forward graph with the backward graph via autograd, producing a full training graph. (c) Deeploy extends its inference compiler to training with a frontend (onnx parsing), a midend (memory optimizer integrating tiling and static allocation across full forward-backward graph), and a backend (C code generation). (d) The generated code is deployed on the heterogeneous socleveraging hierarchical memory mapping (l1--l3) and an on-board accelerator for gemm.
  • Figure 3: Illustration of fine-tuning strategies evaluated on the CCT-2 model for on-device training. The convolutional tokenizer (Conv) is frozen in all strategies, while different subsets of transformer encoder blocks (Attn) and the classifier head are adapted. Five representative strategies are considered: LP (linear probing), only the classifier head is trained; FT-1, full fine-tuning of the last attention block; LoRA-1, low-rank adaptation (LoRA, rank $r=4$) applied to the last attention block; FT-2, full fine-tuning of the last two attention blocks; and LoRA-2, LoRA (rank $r=4$) applied to the last two attention blocks.
  • Figure 4: Hardware setup of the PULP-based SoC modeled in GVSoC. The system consists of a host and an 8-core compute cluster with shared multi-banked L1 TCDM (128 KB), a hierarchical memory with L2 SRAM (2 MB) and external L3 HyperRAM (32 MB), and the redmule floating-point GEMM accelerator integrated with direct low-latency access to L1.
  • Figure 5: End-to-end training latency across fine-tuning strategies. For each strategy, the left bar shows runtime using 8 cores without redmule acceleration, while the right bar shows runtime with redmule acceleration. In the accelerated LoRA-2 and FT-2 configurations, the measured latency corresponds to a peak throughput of up to 11 gradient updates per second under single-sample, end-to-end fine-tuning.
  • ...and 1 more figures