Table of Contents
Fetching ...

Language Model Personalization via Reward Factorization

Idan Shenfeld, Felix Faltings, Pulkit Agrawal, Aldo Pacchiano

TL;DR

This work tackles personalization in RLHF by introducing Personalization via Reward Factorization (PReF), which represents each user’s reward as $r_i(x,y)=\lambda_i^T\phi(x,y)$ over a small set of base reward functions. Base functions are learned from user-annotated pairwise preferences using a regularized maximum-likelihood objective, with SVD initialization to stabilize the bilinear factorization. For new users, PReF rapidly infers user weights $\lambda_i$ through an active-learning loop that selects informative response pairs via uncertainty estimates derived from a Hessian-based ellipsoid, enabling inference-time alignment without retraining the LLM. Across synthetic and real-human experiments, PReF yields substantial personalization gains, achieving 67% win rate against GPT-4o, and demonstrates data efficiency, requiring roughly an order of magnitude fewer user responses than per-user training. The framework thus offers a scalable path to tailored LLM behavior with minimal user effort and no frequent model updates.

Abstract

Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual user preferences, limiting their effectiveness in personalized applications. We introduce a framework that extends RLHF to enable user personalization by leveraging the assumption that user preferences lie in a low-dimensional space. Instead of training a separate model per user, we represent user-specific rewards as a linear combination of base reward functions. Using only ~10 user responses, our method can infer user-specific rewards and align LLM outputs accordingly. We validate our approach through experiments with both synthetic and real users, demonstrating significant personalization achieved by our method. In human evaluations, our method achieves a 67% win rate over default GPT-4o responses.

Language Model Personalization via Reward Factorization

TL;DR

This work tackles personalization in RLHF by introducing Personalization via Reward Factorization (PReF), which represents each user’s reward as over a small set of base reward functions. Base functions are learned from user-annotated pairwise preferences using a regularized maximum-likelihood objective, with SVD initialization to stabilize the bilinear factorization. For new users, PReF rapidly infers user weights through an active-learning loop that selects informative response pairs via uncertainty estimates derived from a Hessian-based ellipsoid, enabling inference-time alignment without retraining the LLM. Across synthetic and real-human experiments, PReF yields substantial personalization gains, achieving 67% win rate against GPT-4o, and demonstrates data efficiency, requiring roughly an order of magnitude fewer user responses than per-user training. The framework thus offers a scalable path to tailored LLM behavior with minimal user effort and no frequent model updates.

Abstract

Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual user preferences, limiting their effectiveness in personalized applications. We introduce a framework that extends RLHF to enable user personalization by leveraging the assumption that user preferences lie in a low-dimensional space. Instead of training a separate model per user, we represent user-specific rewards as a linear combination of base reward functions. Using only ~10 user responses, our method can infer user-specific rewards and align LLM outputs accordingly. We validate our approach through experiments with both synthetic and real users, demonstrating significant personalization achieved by our method. In human evaluations, our method achieves a 67% win rate over default GPT-4o responses.

Paper Structure

This paper contains 41 sections, 3 theorems, 32 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Lemma 4.1

(faury2020improved, Lemma 11) Let $\mathcal{E}_t(\delta) = \{ \lambda \in \mathbb{R}^d \ |\ \| \lambda - \lambda_t \|_{H_t(\lambda)} \leq \gamma_t(\delta)\}$ where $\gamma_t(\delta) = \mathcal{O}\left(d \log\left(\frac{t}{\delta}\right)\right)$, and assume $\|\phi\|\leq1$. The following holds with

Figures (11)

  • Figure 1: We factorize each user's personal reward as a linear combination of base functions. The linear structure enables us to perform personalization in an efficient manner, needing up to x30 fewer answers from the user to achieve the same performance as the standard RLHF approach.
  • Figure 2: ROC AUC and winrates for varying number of user answers on the Attributes (left) and PRISM (right) datasets. Our method quickly achieves high ROC AUC and winrates, outperforming baselines by a large margin.
  • Figure 3: (A) Effect of L2 regularization and SVD initialization on model performance. We see that both choices are crucial to reduce instabilities in training. (B) Increasing the feature dimension $J$ leads to better performance. (C) PReF's uncertainty-based selection of response pairs to obtain user preferences outperforms the naive strategy of random selection.
  • Figure 4: Effect of scaling dataset size (x-axis) and the neural network of the base reward function size (different colors) on the reward model performance in the PRISM dataset.
  • Figure 5: Sorted principal components of the Attributes dataset along with LLM generated descriptions. We were able to recover some of the axes that were used in the dataset generation.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3