Table of Contents
Fetching ...

LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices

Hyunseok Kwak, Kyeongwon Lee, Jae-Jin Lee, Woojoo Lee

TL;DR

LoRA-Edge tackles the challenge of on-device CNN fine-tuning under tight resource budgets by marrying tensor-train decomposition with low-rank adaptation. It preserves the convolutional structure by applying TT-SVD to pre-trained weights, trains only the output-side TT core with zero initialization, and merges updates back into dense kernels to keep inference cost identical to the backbone. The approach achieves near-parallel performance to full fine-tuning across HAR datasets and CNN backbones while updating at most $\approx 1.49\%$ of the parameters and converging up to $3.8\times$ faster on edge hardware. This work demonstrates that structure-aware, merge-after-training PEFT can make practical, frequent on-device CNN adaptation feasible for real-world edge applications.

Abstract

On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.

LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices

TL;DR

LoRA-Edge tackles the challenge of on-device CNN fine-tuning under tight resource budgets by marrying tensor-train decomposition with low-rank adaptation. It preserves the convolutional structure by applying TT-SVD to pre-trained weights, trains only the output-side TT core with zero initialization, and merges updates back into dense kernels to keep inference cost identical to the backbone. The approach achieves near-parallel performance to full fine-tuning across HAR datasets and CNN backbones while updating at most of the parameters and converging up to faster on edge hardware. This work demonstrates that structure-aware, merge-after-training PEFT can make practical, frequent on-device CNN adaptation feasible for real-world edge applications.

Abstract

On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.

Paper Structure

This paper contains 26 sections, 8 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: LoRA overview with trainable low-rank matrices $A,B$ and the merge into $W_0$.
  • Figure 2: Architectural view of LoRA-Edge.
  • Figure 3: Average F1 difference ($\Delta$F1, %) between the proposed TT-SVD init. and random init. at the same learning rate $\eta$, across $\sigma^2$ on Opportunity dataset.
  • Figure 4: F1 over 50 steps for TT-core training strategies on RealWorld.
  • Figure 5: Confusion matrices of LoRA-Edge on Opportunity.
  • ...and 1 more figures