Table of Contents
Fetching ...

Resource-Efficient Personal Large Language Models Fine-Tuning with Collaborative Edge Computing

Shengyuan Ye, Bei Ouyang, Tianyi Qian, Liekang Zeng, Jingyi Li, Jiangsu Du, Xiaowen Chu, Guoliang Xing, Xu Chen

TL;DR

PAC+ addresses the resource bottleneck of personal LLM fine-tuning on edge devices by combining a lightweight Parallel Adapters scheme with an activation cache, quantization, and a heterogeneity-aware collaborative planner. The algorithm-system co-design enables in-situ fine-tuning across proximate edge devices via a hybrid data/pipeline parallelism strategy, while caching backbone activations eliminates repeated forward passes through the large backbone. Empirical evaluation across three LLMs and GLUE tasks shows up to 8.64× end-to-end speedups and up to 88.16% memory reductions compared to baselines, with minimal loss in accuracy. The work demonstrates practical edge deployments for privacy-preserving personal LLMs, offering significant improvements in training efficiency and scalability in heterogeneous edge environments.

Abstract

Large language models (LLMs) have unlocked a plethora of powerful applications at the network edge, such as intelligent personal assistants. Data privacy and security concerns have prompted a shift towards edge-based fine-tuning of personal LLMs, away from cloud reliance. However, this raises issues of computational intensity and resource scarcity, hindering training efficiency and feasibility. While current studies investigate parameter-efficient fine-tuning (PEFT) techniques to mitigate resource constraints, our analysis indicates that these techniques are not sufficiently resource-efficient for edge devices. To tackle these challenges, we propose Pluto and Charon (PAC), a time and memory efficient collaborative edge AI framework for personal LLMs fine-tuning. PAC breaks the resource wall of personal LLMs fine-tuning with a sophisticated algorithm-system co-design. (1) Algorithmically, PAC implements a personal LLMs fine-tuning technique that is efficient in terms of parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Additionally, an activation cache mechanism further streamlining the process by negating the necessity for repeated forward passes across multiple epochs. (2) Systematically, PAC leverages edge devices in close proximity, pooling them as a collective resource for in-situ personal LLMs fine-tuning, utilizing a hybrid data and pipeline parallelism to orchestrate distributed training. The use of the activation cache eliminates the need for forward pass through the LLM backbone,enabling exclusive fine-tuning of the Parallel Adapters using data parallelism. Extensive evaluation based on prototype implementation demonstrates that PAC remarkably outperforms state-of-the-art approaches, achieving up to 8.64x end-to-end speedup and up to 88.16% reduction in memory footprint.

Resource-Efficient Personal Large Language Models Fine-Tuning with Collaborative Edge Computing

TL;DR

PAC+ addresses the resource bottleneck of personal LLM fine-tuning on edge devices by combining a lightweight Parallel Adapters scheme with an activation cache, quantization, and a heterogeneity-aware collaborative planner. The algorithm-system co-design enables in-situ fine-tuning across proximate edge devices via a hybrid data/pipeline parallelism strategy, while caching backbone activations eliminates repeated forward passes through the large backbone. Empirical evaluation across three LLMs and GLUE tasks shows up to 8.64× end-to-end speedups and up to 88.16% memory reductions compared to baselines, with minimal loss in accuracy. The work demonstrates practical edge deployments for privacy-preserving personal LLMs, offering significant improvements in training efficiency and scalability in heterogeneous edge environments.

Abstract

Large language models (LLMs) have unlocked a plethora of powerful applications at the network edge, such as intelligent personal assistants. Data privacy and security concerns have prompted a shift towards edge-based fine-tuning of personal LLMs, away from cloud reliance. However, this raises issues of computational intensity and resource scarcity, hindering training efficiency and feasibility. While current studies investigate parameter-efficient fine-tuning (PEFT) techniques to mitigate resource constraints, our analysis indicates that these techniques are not sufficiently resource-efficient for edge devices. To tackle these challenges, we propose Pluto and Charon (PAC), a time and memory efficient collaborative edge AI framework for personal LLMs fine-tuning. PAC breaks the resource wall of personal LLMs fine-tuning with a sophisticated algorithm-system co-design. (1) Algorithmically, PAC implements a personal LLMs fine-tuning technique that is efficient in terms of parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Additionally, an activation cache mechanism further streamlining the process by negating the necessity for repeated forward passes across multiple epochs. (2) Systematically, PAC leverages edge devices in close proximity, pooling them as a collective resource for in-situ personal LLMs fine-tuning, utilizing a hybrid data and pipeline parallelism to orchestrate distributed training. The use of the activation cache eliminates the need for forward pass through the LLM backbone,enabling exclusive fine-tuning of the Parallel Adapters using data parallelism. Extensive evaluation based on prototype implementation demonstrates that PAC remarkably outperforms state-of-the-art approaches, achieving up to 8.64x end-to-end speedup and up to 88.16% reduction in memory footprint.
Paper Structure (25 sections, 7 equations, 18 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 7 equations, 18 figures, 7 tables, 1 algorithm.

Figures (18)

  • Figure 1: An illustration of hosting personal LLM-based intelligent agents within a smart home.
  • Figure 2: Illustration of the model structures with two PEFT.
  • Figure 3: The comparison of floating point of operations (FLOPs). Mini-batch size: 16; sequence length: 128.
  • Figure 4: PAC+ workflow.
  • Figure 5: Comparison between LLMs fine-tuning with LoRA, Adapters, and our Parallel Adapters.
  • ...and 13 more figures