Table of Contents
Fetching ...

Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

Shuangyi Chen, Yuanxin Guo, Yue Ju, Harik Dalal, Zhongwen Zhu, Ashish Khisti

TL;DR

RoLoRA tackles inexact model updates in federated LoRA fine-tuning by introducing alternating optimization over the LoRA down- and up-projection matrices, enabling robust, expressive adapters under communication constraints. Theoretical analysis on a linear regressor demonstrates exponential convergence to the global optimum, while non-linear experiments and non-convex convergence guarantees extend these insights to practical models. Empirical results on RoBERTa-Large and Llama-2-7B across GLUE, commonsense reasoning, and generation tasks show RoLoRA consistently outperforms FedAVG-LoRA, FFA-LoRA, and FlexLoRA, especially as the number of clients grows or finetuning budgets shrink. The method halves communication compared to full LoRA baselines and scales to large FL settings, offering a principled, scalable approach to robust federated fine-tuning of large language models.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) optimize federated training by reducing computational and communication costs. We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters. Our approach emphasizes the importance of learning up and down projection matrices to enhance expressiveness and robustness. We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches that either generate imperfect model updates or limit expressiveness of the model. We provide a theoretical analysis on a linear model to highlight the importance of learning both the down-projection and up-projection matrices in LoRA. We validate the insights on a non-linear model and separately provide a convergence proof under general conditions. To bridge theory and practice, we conducted extensive experimental evaluations on language models including RoBERTa-Large, Llama-2-7B on diverse tasks and FL settings to demonstrate the advantages of RoLoRA over other methods.

Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

TL;DR

RoLoRA tackles inexact model updates in federated LoRA fine-tuning by introducing alternating optimization over the LoRA down- and up-projection matrices, enabling robust, expressive adapters under communication constraints. Theoretical analysis on a linear regressor demonstrates exponential convergence to the global optimum, while non-linear experiments and non-convex convergence guarantees extend these insights to practical models. Empirical results on RoBERTa-Large and Llama-2-7B across GLUE, commonsense reasoning, and generation tasks show RoLoRA consistently outperforms FedAVG-LoRA, FFA-LoRA, and FlexLoRA, especially as the number of clients grows or finetuning budgets shrink. The method halves communication compared to full LoRA baselines and scales to large FL settings, offering a principled, scalable approach to robust federated fine-tuning of large language models.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) optimize federated training by reducing computational and communication costs. We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters. Our approach emphasizes the importance of learning up and down projection matrices to enhance expressiveness and robustness. We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches that either generate imperfect model updates or limit expressiveness of the model. We provide a theoretical analysis on a linear model to highlight the importance of learning both the down-projection and up-projection matrices in LoRA. We validate the insights on a non-linear model and separately provide a convergence proof under general conditions. To bridge theory and practice, we conducted extensive experimental evaluations on language models including RoBERTa-Large, Llama-2-7B on diverse tasks and FL settings to demonstrate the advantages of RoLoRA over other methods.

Paper Structure

This paper contains 80 sections, 16 theorems, 149 equations, 13 figures, 20 tables, 2 algorithms.

Key Result

Lemma 4.3

Let $\delta^t = \lVert (\mathbf{I}_d-\mathbf{a}^*\mathbf{a}^{*^\top})\mathbf{a}^t\rVert$ be the angle distance between $\mathbf{a}^{*}$ and $\mathbf{a}^{t}$ of $t$-th iteration. Assume that Assumption client-norm-1 holds and $\delta^t \leq \delta^{t-1} \leq \dots \leq \delta^0$. Let $m$ be the numbe with probability at least $1-2q^{-10}$.

Figures (13)

  • Figure 1: (Left) Overview of the RoLoRA framework. (Right) Performance comparison with baselines on QQP in a 50-client setting, showing RoLoRA’s superior convergence speed and final accuracy.
  • Figure 2: (Left) Comparison of three methods on a toy model with 5 clients. (Right) Comparison of three methods on a toy model with 10 clients.
  • Figure 3: Accuracies over rounds with RoBERTa-Large models on SST-2, QNLI, MNLI, and QQP. It involves 50 clients using rank 4.
  • Figure 3: Results with Llama-2-7B models on commonsense reasoning tasks. This involves 50 clients using rank 8.
  • Figure 4: Results with RoBERTa-Large models on GLUE under different fine-tuning parameter budgets, involving three clients with rank 4.
  • ...and 8 more figures

Theorems & Definitions (37)

  • Definition 4.2
  • Lemma 4.3
  • Remark 4.4
  • Theorem 4.5
  • Proposition 4.6
  • Remark 4.7
  • Definition A3.1: Sub-Gaussian Norm
  • Definition A3.2: Sub-Exponential Norm
  • Lemma A3.3: The product of sub-Gaussians is sub-exponential
  • Lemma A3.4: Sum of independent sub-Gaussians
  • ...and 27 more