Table of Contents
Fetching ...

Selective Aggregation for Low-Rank Adaptation in Federated Learning

Pengxin Guo, Shuang Zeng, Yanran Wang, Huijie Fan, Feifei Wang, Liangqiong Qu

TL;DR

This work reveals a fundamental A-B asymmetry in LoRA when applied to federated learning: A encodes general knowledge shared across clients, while B captures client-specific nuances. Leveraging this insight, the authors propose FedSA-LoRA, which trains both A and B but shares only A for server aggregation, enabling personalized local updates via B. The approach extends to rsLoRA and VeRA, establishing a general paradigm for combining LoRA with FL. Theoretical convergence results and extensive NLP experiments (GLUE and generation tasks) demonstrate improved efficiency and performance, especially under non-IID data, with reduced communication rounds and competitive computation costs.

Abstract

We investigate LoRA in federated learning through the lens of the asymmetry analysis of the learned $A$ and $B$ matrices. In doing so, we uncover that $A$ matrices are responsible for learning general knowledge, while $B$ matrices focus on capturing client-specific knowledge. Based on this finding, we introduce Federated Share-A Low-Rank Adaptation (FedSA-LoRA), which employs two low-rank trainable matrices $A$ and $B$ to model the weight update, but only $A$ matrices are shared with the server for aggregation. Moreover, we delve into the relationship between the learned $A$ and $B$ matrices in other LoRA variants, such as rsLoRA and VeRA, revealing a consistent pattern. Consequently, we extend our FedSA-LoRA method to these LoRA variants, resulting in FedSA-rsLoRA and FedSA-VeRA. In this way, we establish a general paradigm for integrating LoRA with FL, offering guidance for future work on subsequent LoRA variants combined with FL. Extensive experimental results on natural language understanding and generation tasks demonstrate the effectiveness of the proposed method. Our code is available at https://github.com/Pengxin-Guo/FedSA-LoRA.

Selective Aggregation for Low-Rank Adaptation in Federated Learning

TL;DR

This work reveals a fundamental A-B asymmetry in LoRA when applied to federated learning: A encodes general knowledge shared across clients, while B captures client-specific nuances. Leveraging this insight, the authors propose FedSA-LoRA, which trains both A and B but shares only A for server aggregation, enabling personalized local updates via B. The approach extends to rsLoRA and VeRA, establishing a general paradigm for combining LoRA with FL. Theoretical convergence results and extensive NLP experiments (GLUE and generation tasks) demonstrate improved efficiency and performance, especially under non-IID data, with reduced communication rounds and competitive computation costs.

Abstract

We investigate LoRA in federated learning through the lens of the asymmetry analysis of the learned and matrices. In doing so, we uncover that matrices are responsible for learning general knowledge, while matrices focus on capturing client-specific knowledge. Based on this finding, we introduce Federated Share-A Low-Rank Adaptation (FedSA-LoRA), which employs two low-rank trainable matrices and to model the weight update, but only matrices are shared with the server for aggregation. Moreover, we delve into the relationship between the learned and matrices in other LoRA variants, such as rsLoRA and VeRA, revealing a consistent pattern. Consequently, we extend our FedSA-LoRA method to these LoRA variants, resulting in FedSA-rsLoRA and FedSA-VeRA. In this way, we establish a general paradigm for integrating LoRA with FL, offering guidance for future work on subsequent LoRA variants combined with FL. Extensive experimental results on natural language understanding and generation tasks demonstrate the effectiveness of the proposed method. Our code is available at https://github.com/Pengxin-Guo/FedSA-LoRA.
Paper Structure (28 sections, 2 theorems, 49 equations, 6 figures, 10 tables)

This paper contains 28 sections, 2 theorems, 49 equations, 6 figures, 10 tables.

Key Result

Lemma 1

Fine-tuning $B$ while fixing $A = Q$, with the goal of optimizing Eq. (eq:lr_loss), yields: Fine-tuning $A$ while fixing $B = U$ and assuming $U^{-1}$ exists, with the goal of optimizing Eq. (eq:lr_loss), yields:

Figures (6)

  • Figure 1: The illustration of (a) LoRA, (b) FFA-LoRA, and (c) FedSA-LoRA. In LoRA, both $A$ and $B$ matrices are trainable and shared with the server for aggregation. In FFA-LoRA, only $B$ matrices are trainable and shared with the server for aggregation, while $A$ matrices are fixed after initialization. In FedSA-LoRA, both $A$ and $B$ matrices are trainable, but only $A$ matrices are shared with the server for aggregation while $B$ matrices are kept locally.
  • Figure 2: Mean of pairwise cosine similarity of the learned $A$ and $B$ matrices across layers of a RoBERTa model locally fine-tuned with LoRA on the RTE task, with different levels of data heterogeneity. (a)-(c): value matrices; (d)-(f): query matrices. The learned $A$ matrices are more similar across clients than the $B$ matrices, and with increased data heterogeneity, the similarity of $B$ matrices between different clients decreases.
  • Figure 3: Mean of pairwise cosine similarity of the learned $A$ matrices across layers of a RoBERTa model locally fine-tuned with LoRA on the RTE task, with different levels of data heterogeneity. (a)-(c): value matrices; (d)-(f): query matrices. The learned $A$ matrices across client are similar but not identical.
  • Figure 4: Cosine similarity of learned and initialized $A$ matrices across layers of different clients of a RoBERTa model locally fine-tuned with LoRA on the RTE task. (a)-(c): value matrices; (d)-(f): query matrices. The learned $A$ matrices are different from the initialized $A$ matrices, indicating that the $A$ matrices are updated.
  • Figure 5: Mean of pairwise cosine similarity of the learned $A$ and $B$ matrices across layers of a RoBERTa model locally fine-tuned with rsLoRA on the RTE task, with different levels of data heterogeneity. (a)-(c): value matrices; (d)-(f): query matrices. The learned $A$ matrices are more similar across clients than the $B$ matrices, and with increased data heterogeneity, the similarity of $B$ matrices between different clients decreases.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Lemma 1
  • Remark 1
  • Theorem 1
  • proof
  • proof