Personalized Language Modeling from Personalized Human Feedback

Xinyu Li; Ruiyang Zhou; Zachary C. Lipton; Liu Leqi

Personalized Language Modeling from Personalized Human Feedback

Xinyu Li, Ruiyang Zhou, Zachary C. Lipton, Liu Leqi

TL;DR

The paper tackles the limitation that vanilla RLHF assumes a single distribution of human preferences, hindering personalization of LLM outputs. It introduces Personalized-RLHF (P-RLHF), which couples a lightweight, learnable user model with a base LLM, using both explicit textual user information and implicit feedback to tailor responses via a Personalized Direct Preference Optimization objective. The authors demonstrate that P-RLHF can learn individualized or cluster-based user embeddings, enabling scalable personalization across large user bases and improving alignment with diverse preferences on synthetic, semi-synthetic, and real-world datasets. This approach reduces the need for multiple models or rewarders, supports unseen users via generic embeddings, and offers a flexible framework compatible with various RLHF variants; code is open-source.

Abstract

Personalized large language models (LLMs) are designed to tailor responses to individual user preferences. While Reinforcement Learning from Human Feedback (RLHF) is a commonly used framework for aligning LLMs with human preferences, vanilla RLHF assumes that all human preferences share the same distribution, preventing fine-tuned LLMs from generating personalized content when user preferences are diverse. In this work, we propose Personalized-RLHF (P-RLHF), an efficient framework that utilizes a lightweight user model to capture individual user preferences and jointly learns the user model and the personalized LLM from human feedback. P-RLHF exhibits the following three characteristics: (1) It enables an LLM to generate personalized content and scale efficiently with growing number of users. (2) It handles both explicit user preferences described as textual input and implicit user preferences encoded in the feedback data. (3) It eliminates the need for users to fully articulate their preferences, which are normally needed for prompting LLMs to generate personalized content yet are often impractical to obtain in real-world scenarios. Our experimental results show that personalized LLMs trained using P-RLHF generate responses that are more closely aligned with individual user preferences, outperforming vanilla, non-personalized RLHF and prompting-based personalization approaches across different tasks. We opensource our code at https://github.com/HumainLab/Personalized_RLHF.

Personalized Language Modeling from Personalized Human Feedback

TL;DR

Abstract

Paper Structure (34 sections, 4 theorems, 16 equations, 6 figures, 7 tables)

This paper contains 34 sections, 4 theorems, 16 equations, 6 figures, 7 tables.

Introduction
Related Work
Vanilla RLHF
Motivation for personalized RLHF: Undesirable Assumption on User Preferences in Vanilla RLHF
Learning from Personalized Human Feedback
Personalized LLM: Problem setup
P-RLHF General Framework
P-RLHF User Models
P-RLHF Learning Objective: Personalized DPO
Experiments
Generation with Conflicting Preferences
Instruction Following under Different Preference Profiles
Personalization on Real-World Preference Dataset with Large User Base
Conclusions
Additional Related Work
...and 19 more sections

Key Result

Lemma 3.1

[$r_\text{vanilla}$ is equivalent to majority voting] For all $i \in [n]$, the estimated user preference under $r_\text{vanilla}$ is given by where $\mathcal{C}_i =\{j \in [n] |x_j = x_i, y_{j,1} = y_{i,1}, y_{j,2} = y_{i,2} \} \cup \{j \in [n] |x_j = x_i, y_{j,1} = y_{i,2}, y_{j,2} = y_{i,1} \}$ is the set of sample indices that share the same prompt and response pairs as $x_i$.

Figures (6)

Figure 1: Our Personalized RLHF framework. A personalized LLM (highlighted in orange) consists of two key components: a learnable user model and a base LLM (introduced in Section 4.2). For training, the user information $u_i$ and the preference data are collected from each user (in this example there are $3$ users $i = 1, 2, 3$). The user model maps the user information into user embeddings (user-specific embeddings $e_i$ and the generic embedding $e_0$ that captures the common preferences shared across users), which are learned jointly with the base LLM using a new P-RLHF learning objective (derived in Section 4.4). During generation, for seen users, the responses tailored to their individual preferences are generated based on the learned user embeddings ($e_i$), while for new users unseen during training, responses are generated using the generic embedding ($e_0$).
Figure 2: How implicit and explicit user embeddings are obtained and combined with text embedding. Dashed boxes indicate optional components. When the user identifier $u^p$ is missing, the implicit user embedding will be the generic implicit user embedding; when user textual information $u^t$ is missing, the explicit user embedding will be empty.
Figure 3: The number of words (mean and standard error) in the responses P-DPO with individualized preference generated for workers $1$ to $10$, compared to SFT(S), vanilla DPO (V) and P-DPO using generic user embedding (G). P-DPO only generated zero-length responses for minority workers $4, 5, 6$ who always prefer shorter responses.
Figure 4: Accuracies (Acc) of vanilla DPO and P-DPO models. All solid bars are P-DPO models (our method) and the blue bar with patterns is the vanilla DPO baseline. (a) The accuracies of top $10$ workers. (b) The accuracies of P-DPO models in the ablation study in Section \ref{['subsec:ablation_study']} on top $10$ workers, where Ind stands for Individual. (c) The accuracies of top $40$ workers.
Figure 5: (a) The Accuracy-top curves over training steps for the vanilla DPO and P-DPO models. (b) The Accuracy-generic curves over training steps for the vanilla DPO and P-DPO models.
...and 1 more figures

Theorems & Definitions (11)

Lemma 3.1
Lemma 3.1
Example 1: Uniform Preference
Example 2: Individualized Preference
Example 3: Cluster-based Preference
Remark 4.1
Lemma B.0
proof
Lemma B.0
proof
...and 1 more

Personalized Language Modeling from Personalized Human Feedback

TL;DR

Abstract

Personalized Language Modeling from Personalized Human Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (11)