Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

Zhen Qin; Daoyuan Chen; Bingchen Qian; Bolin Ding; Yaliang Li; Shuiguang Deng

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

Zhen Qin, Daoyuan Chen, Bingchen Qian, Bolin Ding, Yaliang Li, Shuiguang Deng

TL;DR

This paper tackles the challenge of federated full-parameter tuning for billion-sized LLMs by introducing FedKSeed, a zeroth-order FL approach that communicates only a fixed set of random seeds and scalar gradients, achieving per-round transmission under 18 KB. Building on this, FedKSeed-Pro adds non-uniform seed sampling to prioritize perturbations with larger impact on accuracy, further reducing seed requirements while improving performance. The method maintains convergence comparable to FedZO under standard assumptions and demonstrates average Rouge-L gains of about 7.26% over practical baselines, alongside roughly a thousand-fold reduction in communication. The results show FedKSeed and FedKSeed-Pro enable practical full-parameter federated tuning on devices, with strong robustness across model sizes, datasets, and federated settings, opening pathways for decentralized, privacy-preserving large-model adaptation.

Abstract

Pre-trained large language models (LLMs) need fine-tuning to improve their responsiveness to natural language instructions. Federated learning offers a way to fine-tune LLMs using the abundant data on end devices without compromising data privacy. Most existing federated fine-tuning methods for LLMs rely on parameter-efficient fine-tuning techniques, which may not reach the performance height possible with full-parameter tuning. However, federated full-parameter tuning of LLMs is a non-trivial problem due to the immense communication cost. This work introduces FedKSeed that employs zeroth-order optimization with a finite set of random seeds. It significantly reduces transmission requirements between the server and clients to just a few random seeds and scalar gradients, amounting to only a few thousand bytes, making federated full-parameter tuning of billion-sized LLMs possible on devices. Building on it, we develop a strategy enabling probability-differentiated seed sampling, prioritizing perturbations with greater impact on model accuracy. Experiments across six scenarios with various LLMs, datasets and data partitions demonstrate that our approach outperforms existing federated LLM fine-tuning methods in both communication efficiency and new task generalization.

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

TL;DR

Abstract

Paper Structure (42 sections, 3 theorems, 37 equations, 14 figures, 7 tables, 1 algorithm)

This paper contains 42 sections, 3 theorems, 37 equations, 14 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Problem Formulation
The proposed FedKSeed
Overview
Federated Full-Parameter Tuning by Limited Seeds
Theoretical Support for Seed Reuse
Selection of $K$
Sampling Seeds with Non-uniform Probabilities
Experiments
Experimental Setup
Comparisons on Accuracy Performance
Comparisons on Overheads
Hyper-parameter Sensitivity
Comparisons in Various Federated Scenarios
...and 27 more sections

Key Result

Theorem 1

(Convergence of FedZO.) With the assumptions made by fang2022communication on (assumption-loss-boundary) loss boundary, (assumption-L-smooth) $L$-smoothness of objective and loss functions, (assumption-second-order-moment) the second-order gradient moment boundary and (assumption-gradient-dissimilar where $\tau$ is the average number of local iterations within one round for each client, and $T$ is

Figures (14)

Figure 1: Each step of ZOO can be replicated by 1) a random seed that is used to generate a perturbation, and 2) a scalar gradient on it.
Figure 2: With more total steps, the time required to compute the latest global model by update replication grows rapidly (calculated with LLaMA-3B).
Figure 3: Overview of FedKSeed, where the serial numbers indicate processes in each round. Gray components share identical values among all clients. The underlined components are only required by an enhanced version of it, i.e., FedKSeed-Pro (Section \ref{['subsec-approach-bias']}).
Figure 4: Full-parameter tuning convergence of LLaMA-3B on Natural Instructions by FedZO ($b_1 \!\!=\!\! b_2 \!\!=\!\! 1$) and FedMeZO.
Figure 5: The mean (absolute value) of the $\Gamma$ perturbations randomly sampled with the $K$ candidate random seeds.
...and 9 more figures

Theorems & Definitions (4)

Definition 1
Theorem 1
Lemma 1
Theorem 2

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

TL;DR

Abstract

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (4)