Table of Contents
Fetching ...

When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

Yijiang River Dong, Tiancheng Hu, Yinhong Liu, Ahmet Üstün, Nigel Collier

TL;DR

This work addresses the limitation of RLHF that assumes homogeneous user preferences by proposing a multi-faceted evaluation framework for personalized preference learning in open-domain LLMs. It analyzes eight personalization methods across three general-domain datasets, emphasizing dataset properties such as inter-personal disagreement, intra-personal consistency, minority viewpoints, and room for personalization. Key findings show that personalized reward modeling can outperform baselines, yet personalization can cause up to $20\%$ safety misalignment and even degrade core reasoning in some cases, underscoring a personalization tax. The study also highlights the efficacy of meta-learning approaches like GPO for adapting to new users and cautions that holistic evaluation is essential to develop inclusive, safe, and robust personalized systems for diverse global users.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.

When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

TL;DR

This work addresses the limitation of RLHF that assumes homogeneous user preferences by proposing a multi-faceted evaluation framework for personalized preference learning in open-domain LLMs. It analyzes eight personalization methods across three general-domain datasets, emphasizing dataset properties such as inter-personal disagreement, intra-personal consistency, minority viewpoints, and room for personalization. Key findings show that personalized reward modeling can outperform baselines, yet personalization can cause up to safety misalignment and even degrade core reasoning in some cases, underscoring a personalization tax. The study also highlights the efficacy of meta-learning approaches like GPO for adapting to new users and cautions that holistic evaluation is essential to develop inclusive, safe, and robust personalized systems for diverse global users.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.

Paper Structure

This paper contains 34 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Each user has a unique preference distribution in the response space. Traditional preference learning systems treat preference data as homogeneous, but the inherent self-conflicting nature of preferences makes them difficult and unstable to learn. A personalized preference learning system, however, can effectively capture and model the individual preference distribution for each user. The scatter plot visualizes the preferred response embeddings from Personal LLM zollo_personalllm_2024 for three selected users using PCA.
  • Figure 2: Averaged Reward Model Accuracy Comparison Across Three Personalization Datasets. Figures (a), (b), and (c) show averaged accuracy results across three datasets with varying number of training samples. Figure (d) compares the accuracy of personalized algorithms across three datasets and across different models.
  • Figure 3: Adaptation to New Users on Personal-LLM: The figure presents the performance of different baselines in adapting to new users with varying amounts of training data. The dashed black line represents the accuracy of the Individual RM trained on the full dataset, serving as the theoretical upper bound.
  • Figure 4: Testing Personalization Tax on Reward Bench. We measure the accuracy and reward bench performance for the personalization methods and show its deviation from the pre-trained RM. We report the change in accuracy relative to pre-trained RM dong2023raft.
  • Figure 5: Per-user Accuracy on Personal-LLM. User 8 is considered the minority since as we calculated it has 0.33 accuracy after majority voting in Table \ref{['dataset']}.