LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices
Hyunseok Kwak, Kyeongwon Lee, Jae-Jin Lee, Woojoo Lee
TL;DR
LoRA-Edge tackles the challenge of on-device CNN fine-tuning under tight resource budgets by marrying tensor-train decomposition with low-rank adaptation. It preserves the convolutional structure by applying TT-SVD to pre-trained weights, trains only the output-side TT core with zero initialization, and merges updates back into dense kernels to keep inference cost identical to the backbone. The approach achieves near-parallel performance to full fine-tuning across HAR datasets and CNN backbones while updating at most $\approx 1.49\%$ of the parameters and converging up to $3.8\times$ faster on edge hardware. This work demonstrates that structure-aware, merge-after-training PEFT can make practical, frequent on-device CNN adaptation feasible for real-world edge applications.
Abstract
On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.
