Table of Contents
Fetching ...

Split Fine-Tuning for Large Language Models in Wireless Networks

Songge Zhang, Guoliang Cheng, Xinyu Huang, Zuguang Li, Wen Wu, Lingyang Song, Xuemin Shen

TL;DR

This work tackles the challenge of fine-tuning large language models on resource-limited mobile devices over wireless networks. It introduces Split Fine-Tuning (SFT), which partitions the LLM between a edge server and devices, enabling parallel device updates and server-side aggregation via LoRA adapters. A joint compression scheme (Top-K sparsification, stochastic quantization, and lossless encoding) drastically reduces inter-device communication, while a two-timescale optimization framework (augmented Lagrangian for configuration and SQP for bandwidth) minimizes fine-tuning delay under accuracy and memory constraints. Across simulations on CIFAR100 and Tiny-ImageNet, SFT achieves up to 80.2% delay reduction and 93.6% communication overhead reduction, with strong accuracy and memory performance, demonstrating practical viability for collaborative LLM fine-tuning in resource-constrained wireless networks.

Abstract

Fine-tuning is the process of adapting the pre-trained large language models (LLMs) for downstream tasks. Due to substantial parameters, fine-tuning LLMs on mobile devices demands considerable memory resources, and suffers from high communication overhead and long fine-tuning delay. In this paper, we propose an efficient LLM fine-tuning scheme in wireless networks, named Split Fine-Tuning (SFT), which can accommodate LLM fine-tuning on mobile devices. Specifically, an LLM is split into a server-side part on the edge server and a device-side part on the mobile device to satisfy the device-side memory constraint. All devices share a server-side model and perform parallel fine-tuning to reduce fine-tuning delay. In addition, to reduce significant communication overhead incurred by data exchange between devices and the edge server, we propose a data compression scheme by jointly leveraging sparsification, stochastic quantization, and lossless encoding methods. Furthermore, we formulate a fine-tuning delay minimization problem under accuracy and memory constraints, taking device heterogeneity and channel dynamics into account. To solve the problem, the nonlinear mixed-integer problem is decoupled into two subproblems in different timescales. The two-timescale resource management algorithm is proposed to jointly optimize the compression rate and transformer block allocation in the large timescale using the augmented Lagrangian method, and determine spectrum resource allocation in the small timescale via sequential quadratic programming. Extensive simulation results demonstrate that the proposed scheme can reduce the fine-tuning delay by up to 80.2% and communication overhead by 93.6% compared to state-of-the-art benchmarks, while satisfying device-side memory and model accuracy constraints.

Split Fine-Tuning for Large Language Models in Wireless Networks

TL;DR

This work tackles the challenge of fine-tuning large language models on resource-limited mobile devices over wireless networks. It introduces Split Fine-Tuning (SFT), which partitions the LLM between a edge server and devices, enabling parallel device updates and server-side aggregation via LoRA adapters. A joint compression scheme (Top-K sparsification, stochastic quantization, and lossless encoding) drastically reduces inter-device communication, while a two-timescale optimization framework (augmented Lagrangian for configuration and SQP for bandwidth) minimizes fine-tuning delay under accuracy and memory constraints. Across simulations on CIFAR100 and Tiny-ImageNet, SFT achieves up to 80.2% delay reduction and 93.6% communication overhead reduction, with strong accuracy and memory performance, demonstrating practical viability for collaborative LLM fine-tuning in resource-constrained wireless networks.

Abstract

Fine-tuning is the process of adapting the pre-trained large language models (LLMs) for downstream tasks. Due to substantial parameters, fine-tuning LLMs on mobile devices demands considerable memory resources, and suffers from high communication overhead and long fine-tuning delay. In this paper, we propose an efficient LLM fine-tuning scheme in wireless networks, named Split Fine-Tuning (SFT), which can accommodate LLM fine-tuning on mobile devices. Specifically, an LLM is split into a server-side part on the edge server and a device-side part on the mobile device to satisfy the device-side memory constraint. All devices share a server-side model and perform parallel fine-tuning to reduce fine-tuning delay. In addition, to reduce significant communication overhead incurred by data exchange between devices and the edge server, we propose a data compression scheme by jointly leveraging sparsification, stochastic quantization, and lossless encoding methods. Furthermore, we formulate a fine-tuning delay minimization problem under accuracy and memory constraints, taking device heterogeneity and channel dynamics into account. To solve the problem, the nonlinear mixed-integer problem is decoupled into two subproblems in different timescales. The two-timescale resource management algorithm is proposed to jointly optimize the compression rate and transformer block allocation in the large timescale using the augmented Lagrangian method, and determine spectrum resource allocation in the small timescale via sequential quadratic programming. Extensive simulation results demonstrate that the proposed scheme can reduce the fine-tuning delay by up to 80.2% and communication overhead by 93.6% compared to state-of-the-art benchmarks, while satisfying device-side memory and model accuracy constraints.
Paper Structure (41 sections, 38 equations, 10 figures, 3 tables, 3 algorithms)

This paper contains 41 sections, 38 equations, 10 figures, 3 tables, 3 algorithms.

Figures (10)

  • Figure 1: (a) In the SFT frame, devices are trained parallelly with a shared server-side pre-trained model and multiple LoRAs$\mathrm{;}$ (b) All Transformer blocks are divided into the server-side and the device-side parts. Each transformer block consists of an MSA and an MLP, both of which are composed of pre-trained model weights and corresponding LoRA.
  • Figure 2: Transmission compression scheme.
  • Figure 3: Fitting results for SFT data processing accuracy.
  • Figure 4: Fine-tuning delay in each round.
  • Figure 5: Fine-tuning performance comparison among different schemes.
  • ...and 5 more figures