Language Model Personalization via Reward Factorization

Idan Shenfeld; Felix Faltings; Pulkit Agrawal; Aldo Pacchiano

Language Model Personalization via Reward Factorization

Idan Shenfeld, Felix Faltings, Pulkit Agrawal, Aldo Pacchiano

TL;DR

This work tackles personalization in RLHF by introducing Personalization via Reward Factorization (PReF), which represents each user’s reward as $r_i(x,y)=\lambda_i^T\phi(x,y)$ over a small set of base reward functions. Base functions are learned from user-annotated pairwise preferences using a regularized maximum-likelihood objective, with SVD initialization to stabilize the bilinear factorization. For new users, PReF rapidly infers user weights $\lambda_i$ through an active-learning loop that selects informative response pairs via uncertainty estimates derived from a Hessian-based ellipsoid, enabling inference-time alignment without retraining the LLM. Across synthetic and real-human experiments, PReF yields substantial personalization gains, achieving 67% win rate against GPT-4o, and demonstrates data efficiency, requiring roughly an order of magnitude fewer user responses than per-user training. The framework thus offers a scalable path to tailored LLM behavior with minimal user effort and no frequent model updates.

Abstract

Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual user preferences, limiting their effectiveness in personalized applications. We introduce a framework that extends RLHF to enable user personalization by leveraging the assumption that user preferences lie in a low-dimensional space. Instead of training a separate model per user, we represent user-specific rewards as a linear combination of base reward functions. Using only ~10 user responses, our method can infer user-specific rewards and align LLM outputs accordingly. We validate our approach through experiments with both synthetic and real users, demonstrating significant personalization achieved by our method. In human evaluations, our method achieves a 67% win rate over default GPT-4o responses.

Language Model Personalization via Reward Factorization

TL;DR

This work tackles personalization in RLHF by introducing Personalization via Reward Factorization (PReF), which represents each user’s reward as

over a small set of base reward functions. Base functions are learned from user-annotated pairwise preferences using a regularized maximum-likelihood objective, with SVD initialization to stabilize the bilinear factorization. For new users, PReF rapidly infers user weights

through an active-learning loop that selects informative response pairs via uncertainty estimates derived from a Hessian-based ellipsoid, enabling inference-time alignment without retraining the LLM. Across synthetic and real-human experiments, PReF yields substantial personalization gains, achieving 67% win rate against GPT-4o, and demonstrates data efficiency, requiring roughly an order of magnitude fewer user responses than per-user training. The framework thus offers a scalable path to tailored LLM behavior with minimal user effort and no frequent model updates.

Language Model Personalization via Reward Factorization

TL;DR

Abstract

Language Model Personalization via Reward Factorization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (3)