Table of Contents
Fetching ...

FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment

Kewen Zhu, Liping Yi, Zhiming Zhao, Zhuang Qi, Han Yu, Qinghua Hu

Abstract

Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.

FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment

Abstract

Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.
Paper Structure (69 sections, 26 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 69 sections, 26 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Accuracy comparison of DPO under IID and non-IID settings on the code vulnerability dataset.
  • Figure 2: Overview of FedPDPO. Each client fine-tunes and shares a LoRA adapter $(\boldsymbol{A},\boldsymbol{B})$ on a frozen backbone $\boldsymbol{W}_0$ while locally training personalized modules: a bottleneck adapter $\boldsymbol{M}$, local LLM and reward heads $(\boldsymbol{h}_1,\boldsymbol{h}_2)$ with an effective PDPO training strategy.
  • Figure 3: Test accuracy varies as communication rounds in intra-domain FL settings with the IMDB dataset and 10 clients.
  • Figure 4: Test accuracy varies as communication rounds in cross-domain FL settings with 3 domains of datasets assigned to 3 clients.
  • Figure 5: An example preference triple from the IMDB dataset. The prompt is a movie review prefix, and the model must learn to prefer the chosen response (positive sentiment, blue) over the rejected response (negative sentiment, red).
  • ...and 5 more figures