Table of Contents
Fetching ...

Stabilized Fine-Tuning with LoRA in Federated Learning: Mitigating the Side Effect of Client Size and Rank via the Scaling Factor

Jiayu Huang, Xiaohu Wu, Tiantian He, Qicheng Lao

TL;DR

Stabilized Federated LoRA is introduced, a framework that theoretically characterizes the interaction between adapter rank and federated aggregation, and derives an optimal scaling factor designed to effectively mitigate the aggregation error accumulating across N clients.

Abstract

Large Language Models (LLMs) are pivotal in natural language processing. The impracticality of full fine-tuning has prompted Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA), optimizing low-rank matrices A and B. In distributed scenarios where privacy constraints necessitate Federated Learning (FL), however, the integration of LoRA is often unstable. Specifically, we identify that aggregating updates from multiple clients introduces statistical variance that scales with the client count, causing gradient collapse when using high-rank adapters. Existing scaling factor candidates, such as the one used by Rank-Stabilized LoRA, ignore the interaction caused by the aggregation process. To bridge this gap, this paper introduces Stabilized Federated LoRA (SFed-LoRA), a framework that theoretically characterizes the interaction between adapter rank and federated aggregation. We derive an optimal scaling factor designed to effectively mitigate the aggregation error accumulating across N clients. By correcting the scaling mismatch inherent in previous approaches, SFed-LoRA restores the efficacy of high-rank adaptation without altering the original model architecture or increasing inference latency. Extensive experiments in diverse tasks, model architectures, and heterogeneous data distributions are conducted to validate our results. We demonstrate that SFed-LoRA prevents high-rank collapse, and achieves significantly improved stability and faster convergence compared with state-of-the-art baselines for high-rank adaptation.

Stabilized Fine-Tuning with LoRA in Federated Learning: Mitigating the Side Effect of Client Size and Rank via the Scaling Factor

TL;DR

Stabilized Federated LoRA is introduced, a framework that theoretically characterizes the interaction between adapter rank and federated aggregation, and derives an optimal scaling factor designed to effectively mitigate the aggregation error accumulating across N clients.

Abstract

Large Language Models (LLMs) are pivotal in natural language processing. The impracticality of full fine-tuning has prompted Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA), optimizing low-rank matrices A and B. In distributed scenarios where privacy constraints necessitate Federated Learning (FL), however, the integration of LoRA is often unstable. Specifically, we identify that aggregating updates from multiple clients introduces statistical variance that scales with the client count, causing gradient collapse when using high-rank adapters. Existing scaling factor candidates, such as the one used by Rank-Stabilized LoRA, ignore the interaction caused by the aggregation process. To bridge this gap, this paper introduces Stabilized Federated LoRA (SFed-LoRA), a framework that theoretically characterizes the interaction between adapter rank and federated aggregation. We derive an optimal scaling factor designed to effectively mitigate the aggregation error accumulating across N clients. By correcting the scaling mismatch inherent in previous approaches, SFed-LoRA restores the efficacy of high-rank adaptation without altering the original model architecture or increasing inference latency. Extensive experiments in diverse tasks, model architectures, and heterogeneous data distributions are conducted to validate our results. We demonstrate that SFed-LoRA prevents high-rank collapse, and achieves significantly improved stability and faster convergence compared with state-of-the-art baselines for high-rank adaptation.
Paper Structure (40 sections, 2 theorems, 28 equations, 9 figures, 2 tables)

This paper contains 40 sections, 2 theorems, 28 equations, 9 figures, 2 tables.

Key Result

Theorem 4.2

Consider a federated learning ecosystem with $N$ clients and LoRA adapters scaled by $\gamma_z \in \mathbb{R}$, where $z=(N, r)$. Let the rank $r \to \infty$ and $\gamma_z \to 0$. The adapters are $(N, r)$-federated-stabilized (per Definition def:federated_stabilized) if and only if: In particular, unless $\gamma_z$ scales according to eq:gamma_z_theorem, the learning process exhibits instability

Figures (9)

  • Figure 1: The framework of SFed-LoRA. It adopts a split aggregation strategy where clients upload only matrix $A$ while maintaining $B$ locally to protect privacy. A novel scaling factor $\gamma_z = \alpha\sqrt{N/r}$ is integrated into the local computation (as shown in the equation) to counteract aggregation interference, ensuring stable training performance across varying client numbers.
  • Figure 2: Convergence of Perplexity (PPL) on the Alpaca dataset using the LLaMA2-7B model under an IID federated learning setting. The subplots compare the training trajectories of four methods across ranks $r \in \{4, 8, 32, 128, 512\}$: (a) RoLoRA (purple), (b) FedSA-LoRA (copper), (c) FedSA-rsLoRA (blue), and (d) SFed-LoRA (green). Darker curves correspond to higher ranks. The figure displays the evolution of validation perplexity over 100 communication rounds.
  • Figure 3: Evolution of average parameter gradient norms on the Alpaca dataset (IID). The subplots display the training trajectories for (a) RoLoRA (purple), (b) FedSA-LoRA (copper), (c) FedSA-rsLoRA (blue), and (d) SFed-LoRA (green) across ranks $r \in \{4, 8, 32, 128, 512\}$. Darker curves correspond to higher ranks.
  • Figure 4: Comparative analysis of perplexity with a fixed rank $r = 512$ using the LLaMA2-7B-hf model on the Alpaca dataset in an IID federated learning setting. The subplots correspond to varying client counts: (a) $N=5$, (b) $N=10$, (c) $N=15$, and (d) $N=20$. Within each plot, the curves represent RoLoRA (purple), FedSA-LoRA (orange), FedSA-rsLoRA (blue), and SFed-LoRA (green).
  • Figure 5: Evolution of average parameter gradient norms on the GSM8K dataset using the LLaMA2-7B model. The subplots display the training trajectories for (a) RoLoRA, (b) FedSA-LoRA, (c) FedSA-rsLoRA, and (d) SFed-LoRA across ranks $r \in \{4, 8, 32, 128, 512\}$. Darker curves correspond to higher ranks.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 4.1: $(N, r)$-Federated-Stabilized Adapter
  • Theorem 4.2: Optimal Federated Scaling Factor
  • Definition 1.1
  • Theorem 1.2
  • proof