Table of Contents
Fetching ...

Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Block Gradient Descent

Lin Wang, Zhichao Wang, Xiaoying Tang

TL;DR

This paper tackles the high resource barrier for fully tuning large language models in federated learning by introducing FedCyBGD, a Cycle Block Gradient Descent framework that updates model blocks cyclically across clients and exchanges only compressed, block-level updates. It combines a block-wise training schedule with a hybrid compression scheme (layer dropping and pruning) to enable full-parameter tuning on edge devices while drastically reducing download, upload, memory, and computation costs. The method is shown to achieve competitive or superior performance to centralized full tuning across diverse LLMs and NLP tasks, with substantial resource savings and compatibility with PEFT approaches. The proposed approach thus offers a practical, privacy-preserving path to scalable FL for LLMs in real-world, resource-limited environments.

Abstract

The advent of large language models (LLMs) has revolutionized the deep learning paradigm, yielding impressive results across a wide array of tasks. However, the pre-training or fine-tuning of LLMs within a federated learning (FL) framework poses substantial challenges, including considerable computational and memory resource demands, as well as communication bottlenecks between servers and clients. Existing solutions either make the unrealistic assumption that the entire model is exchanged for training, or apply parameter-effective fine-tuning methods from centralized learning to train LLMs in FL which tend to underperform during training or fine-tuning stages due to the limited search subspace of parameter updating. In this paper, we introduce a novel method for the efficient training and fine-tuning of LLMs in FL, with minimal resource consumption. Our approach, termed FedCyBGD, utilizes Cycle Block Gradient Descent to periodically update the model. In particular, we design a compression scheme for FedCyBGD, aiming to further decrease the model download cost. It enables full parameter training in FL with only selected block updates and uploads, thereby reducing communication, computation, and memory costs. Our method achieves state-of-the-art performance for FL LLM training, while significantly reducing associated costs. Codes are provided here.

Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Block Gradient Descent

TL;DR

This paper tackles the high resource barrier for fully tuning large language models in federated learning by introducing FedCyBGD, a Cycle Block Gradient Descent framework that updates model blocks cyclically across clients and exchanges only compressed, block-level updates. It combines a block-wise training schedule with a hybrid compression scheme (layer dropping and pruning) to enable full-parameter tuning on edge devices while drastically reducing download, upload, memory, and computation costs. The method is shown to achieve competitive or superior performance to centralized full tuning across diverse LLMs and NLP tasks, with substantial resource savings and compatibility with PEFT approaches. The proposed approach thus offers a practical, privacy-preserving path to scalable FL for LLMs in real-world, resource-limited environments.

Abstract

The advent of large language models (LLMs) has revolutionized the deep learning paradigm, yielding impressive results across a wide array of tasks. However, the pre-training or fine-tuning of LLMs within a federated learning (FL) framework poses substantial challenges, including considerable computational and memory resource demands, as well as communication bottlenecks between servers and clients. Existing solutions either make the unrealistic assumption that the entire model is exchanged for training, or apply parameter-effective fine-tuning methods from centralized learning to train LLMs in FL which tend to underperform during training or fine-tuning stages due to the limited search subspace of parameter updating. In this paper, we introduce a novel method for the efficient training and fine-tuning of LLMs in FL, with minimal resource consumption. Our approach, termed FedCyBGD, utilizes Cycle Block Gradient Descent to periodically update the model. In particular, we design a compression scheme for FedCyBGD, aiming to further decrease the model download cost. It enables full parameter training in FL with only selected block updates and uploads, thereby reducing communication, computation, and memory costs. Our method achieves state-of-the-art performance for FL LLM training, while significantly reducing associated costs. Codes are provided here.
Paper Structure (29 sections, 7 equations, 4 figures, 10 tables, 2 algorithms)

This paper contains 29 sections, 7 equations, 4 figures, 10 tables, 2 algorithms.

Figures (4)

  • Figure 1: Observation on Federated Learning. Bar graphs represent the estimated memory usage for full parameter tuning of an LLaMA-7B model on a single device and the line graph represents the loss across different training paradigms. 'Centralized-Cy' denotes centralized training with cyclical block updates, 'Fed-full' refers to federated full parameter tuning with complete model communication to clients, 'FedBAvg' signifies federated training with block updates where the server selects clients for tuning and aggregates updates, and 'FedCyBGD' represents our approach, where clients cyclically participate in block tuning.
  • Figure 2: Overview of FedCyBGD training. In FedCyBGD, the server sends responsible (Block 1) and compressed unresponsible blocks (Others) to the client. The client fine-tunes the responsible block with local data, using the frozen compressed model. The refined block is returned to the server for integration. The process, involving only compressed parameters for download and block for upload, minimizes communication parameters. By updating only responsible blocks per client, computational and memory costs are reduced, enabling full-parameter tuning for FL on resource-limited edge devices.
  • Figure 3: The performance of FedCyBGD under different block allocation strategies.
  • Figure 4: Overall convergence performance in FedCyBGD, from beginning to ending.

Theorems & Definitions (1)

  • Definition 3.1