Efficient Deployment of Large Language Models on Resource-constrained Devices

Zhiwei Yao; Yang Xu; Hongli Xu; Yunming Liao; Zuan Xie

Efficient Deployment of Large Language Models on Resource-constrained Devices

Zhiwei Yao, Yang Xu, Hongli Xu, Yunming Liao, Zuan Xie

TL;DR

FedSpine tackles the problem of deploying large language models on resource-constrained devices under privacy and data distribution constraints. It combines parameter-efficient fine-tuning via LoRA with structured pruning inside a federated learning framework and uses a novel online multi-armed bandit (S-UCB) to adapt pruning ratios and LoRA ranks per device, mitigating stragglers and improving efficiency. The approach includes loss-based, group-aware pruning guided by LoRA gradients, adaptive LoRA rank allocation based on per-device importance (including singular values), and a heterogeneity-aware aggregation scheme, achieving 1.4x–6.9x speedups and 0.4%–4.5% accuracy gains on an 80-device testbed. These results demonstrate practical viability for scalable, privacy-preserving on-device LLM deployment with diverse hardware and data distributions.

Abstract

Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$\times$-6.9$\times$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.

Efficient Deployment of Large Language Models on Resource-constrained Devices

TL;DR

Abstract

-6.9

and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.

Paper Structure (20 sections, 17 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 20 sections, 17 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries and Motivations
Integrating LoRA into FL: Benefits and Potential Challenges
Combination of Pruning and Fine-tuning
Impact of Device Heterogeneity
Motivations for Adaptive Pruning and Fine-tuning
System Overview
FedSpine Design
Algorithm for Configuration Update
Problem Definition
Multi-Armed Bandit Based Algorithm
Model Pruning
Model Fine-tuning
LoRA Aggregation
Implementation and Evaluation
...and 5 more sections

Figures (6)

Figure 1: Peak memory footprint and test accuracy of different fine-tuning methods and inference.
Figure 2: The results of preliminary experiments ($p$ and $r$ separately denote pruning ratio and LoRA rank). (a) The fine-tuning time and exchanged parameters of LoRA and Full-FT for RoBERTa on SST-2; (b) Test accuracy of three methods after reaching the target pruning ratio on SST-2; (c) Ranked completion time consumed by heterogeneous devices with identical pruning ratios and LoRA ranks in one round; (d) The fine-tuning process of FedAPT with different pruning ratios and LoRA ranks.
Figure 3: Overview of FedSpine's workflow.
Figure 4: Performance comparison of RoBERTa on SST-2 and MNLI.
Figure 5: Completion time under different heterogeneous levels.
...and 1 more figures

Efficient Deployment of Large Language Models on Resource-constrained Devices

TL;DR

Abstract

Efficient Deployment of Large Language Models on Resource-constrained Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (6)