Table of Contents
Fetching ...

APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki

Abstract

Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human feedback (FedRLHF), these groups align a shared policy without centralizing preference data, which makes fair reward aggregation essential. Existing aggregation methods exhibit clear trade offs: average based aggregation systematically under aligns worst performing groups, while min aggregation prioritizes worst group performance at the cost of overall alignment. We propose APPA, an Adaptive Preference Pluralistic Alignment framework that dynamically reweights group level rewards based on historical alignment rewards. Our approach prioritizes under aligned groups without degrading well aligned ones, while requiring no access to raw preference data. Integrated into a proximal policy optimization (PPO) based FedRLHF pipeline and evaluated on GLOBALQA and OQA across three model families (Gemma 2 2B, Llama 3.2 3B, Qwen3 0.6B), APPA achieves strong fairness alignment trade offs, improving worst group alignment by up to 28% over average aggregation while maintaining higher overall alignment than min aggregation across most configurations.

APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

Abstract

Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human feedback (FedRLHF), these groups align a shared policy without centralizing preference data, which makes fair reward aggregation essential. Existing aggregation methods exhibit clear trade offs: average based aggregation systematically under aligns worst performing groups, while min aggregation prioritizes worst group performance at the cost of overall alignment. We propose APPA, an Adaptive Preference Pluralistic Alignment framework that dynamically reweights group level rewards based on historical alignment rewards. Our approach prioritizes under aligned groups without degrading well aligned ones, while requiring no access to raw preference data. Integrated into a proximal policy optimization (PPO) based FedRLHF pipeline and evaluated on GLOBALQA and OQA across three model families (Gemma 2 2B, Llama 3.2 3B, Qwen3 0.6B), APPA achieves strong fairness alignment trade offs, improving worst group alignment by up to 28% over average aggregation while maintaining higher overall alignment than min aggregation across most configurations.

Paper Structure

This paper contains 67 sections, 24 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of APPA: fedrlhf for pluralistic alignment of group preferences. At each ppo iteration, the server distributes rollouts to federated groups; each group scores responses using its local PluralLLM module and returns group-specific rewards. The server aggregates these rewards via the APPA adaptive aggregation before updating the policy.
  • Figure 2: Per-group alignment score comparison across selected demographic groups (Base vs. SFT vs. PPO-Min vs. PPO-Average vs. PPO-APPA). Top row: GLOBALQA, eight countries with diverse opinions (JS metric). Bottom row: OQA, eight US demographic groups with diverse opinions (Wasserstein metric). A wider, more uniformly filled polygon indicates higher and more equitable alignment across groups.
  • Figure 3: Fairness--Alignment trade-off: QA Fairness Index vs. Minimum Alignment Score (GLOBALQA). Left: DPA task using JS metric. Right: OPA task using Borda metric. Each point is a (model, aggregation strategy) pair. PPO-APPA (red stars) generally occupies the upper-right region across the evaluated model families. Gray lines connect strategy points within each model family, illustrating the progression from Base through SFT to PPO strategies.
  • Figure 4: Distributional Preference Alignment (DPA) prompt template. The model outputs a calibrated probability distribution over all answer options. The number of options $K$ varies dynamically per question.
  • Figure 5: Ordinal Preference Alignment (OPA) prompt template. The model outputs a ranked ordering of all answer options from most to least preferred. The number of options $K$ varies dynamically per question.
  • ...and 1 more figures