Table of Contents
Fetching ...

Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Nicolas Wagner, Dongyang Fan, Martin Jaggi

TL;DR

This work tackles on-device fine-tuning of large language models under data scarcity and privacy constraints by framing collaboration as a trust-weighted, decentralized learning problem. It introduces three aggregation schemes—weights similarity, validation-performance-based, and prediction similarity-based—built atop Low-Rank Adaptation (LoRA) to minimize communication. Empirical results across diverse datasets show that prediction-based trust (and, to a lesser extent, validation-based trust) yields the best personalization performance, often surpassing FedAvg and local fine-tuning, especially under high data heterogeneity. The study demonstrates practical, communication-efficient, and privacy-preserving approaches for personalized LLM deployment, with insights into topology choices and trust behavior that guide future work in decentralized, privacy-aware NLP systems.

Abstract

We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability. Taking inspiration from the collaborative learning community, we introduce three distinct trust-weighted gradient aggregation schemes: weight similarity-based, prediction similarity-based and validation performance-based. To minimize communication overhead, we integrate Low-Rank Adaptation (LoRA) and only exchange LoRA weight updates. Our protocols, driven by prediction and performance metrics, surpass both FedAvg and local fine-tuning methods, which is particularly evident in realistic scenarios with more diverse local data distributions. The results underscore the effectiveness of our approach in addressing heterogeneity and scarcity within local datasets.

Personalized Collaborative Fine-Tuning for On-Device Large Language Models

TL;DR

This work tackles on-device fine-tuning of large language models under data scarcity and privacy constraints by framing collaboration as a trust-weighted, decentralized learning problem. It introduces three aggregation schemes—weights similarity, validation-performance-based, and prediction similarity-based—built atop Low-Rank Adaptation (LoRA) to minimize communication. Empirical results across diverse datasets show that prediction-based trust (and, to a lesser extent, validation-based trust) yields the best personalization performance, often surpassing FedAvg and local fine-tuning, especially under high data heterogeneity. The study demonstrates practical, communication-efficient, and privacy-preserving approaches for personalized LLM deployment, with insights into topology choices and trust behavior that guide future work in decentralized, privacy-aware NLP systems.

Abstract

We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability. Taking inspiration from the collaborative learning community, we introduce three distinct trust-weighted gradient aggregation schemes: weight similarity-based, prediction similarity-based and validation performance-based. To minimize communication overhead, we integrate Low-Rank Adaptation (LoRA) and only exchange LoRA weight updates. Our protocols, driven by prediction and performance metrics, surpass both FedAvg and local fine-tuning methods, which is particularly evident in realistic scenarios with more diverse local data distributions. The results underscore the effectiveness of our approach in addressing heterogeneity and scarcity within local datasets.
Paper Structure (31 sections, 4 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 31 sections, 4 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Diagram of our protocol. ${\bm{\theta}}_i$ and $\Delta {\bm{\theta}}_i$ represent LoRA weights and LoRA weight updates respectively. $s_i$ denotes messages to send beside $\Delta {\bm{\theta}}_i$, which represents either ${\bm{\theta}}_i$ or $f_{{\bm{\theta}}_i}({\bm{X}}_S)$ depending on the protocol (see Table \ref{['tab: communication_vs_computation']}). $g(\cdot)$ is our proposed trust calculation approach as detailed in Section \ref{['trust-calculation']}.
  • Figure 2: Ablation study of whether to add LoRA modules compared to full fine-tuning. Left bars correspond to Ratio, which denotes the decrease in test perplexity compared to the pre-trained model per trainable parameter (higher is better); right bars correspond to perplexity, which denotes test performance after fine-tuning (lower is better).
  • Figure 3: L1 norm of differences between ${\bm{W}}$ at two consecutive steps for Strategy 2 and Strategy 3
  • Figure 4: Training time ($X$ times the needed training iterations as in the fully connected case) required for a ring topology to achieve the same perplexity level as a fully connected topology. NA indicates we did not reach the same perplexity after ten times the training iterations.
  • Figure 4: Oracle trust matrix versus learned trust matrix using strategy 1 when users are allocated with Multilingual Wikipedia datasets. The diagonal entries are masked out and the trust is measured when the training ends.
  • ...and 5 more figures