Learning to summarize user information for personalized reinforcement learning from human feedback

Hyunji Nam; Yanming Wan; Mickel Liu; Peter Ahnn; Jianxun Lian; Natasha Jaques

Learning to summarize user information for personalized reinforcement learning from human feedback

Hyunji Nam, Yanming Wan, Mickel Liu, Peter Ahnn, Jianxun Lian, Natasha Jaques

TL;DR

This work tackles the limitation of RLHF by modeling diverse user preferences with a user-conditioned reward model. It introduces PLUS, which jointly learns a text-based user summary $z$ from context $c$ via a summarizer and a reward model $r_\phi(s|z)$, in an online co-adaptive loop using PPO. PLUS achieves 11–77% gains in reward-model accuracy over traditional BTL and demonstrates robustness to topic shifts and new users, including zero-shot personalization of proprietary models like GPT-4o. The approach yields human-readable, interpretable user representations that enable personalized responses and improved transparency in LLM alignment, with demonstrated benefits on real-world pluralistic datasets such as PRISM.

Abstract

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley-Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11-77/% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25\% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72\% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.

Learning to summarize user information for personalized reinforcement learning from human feedback

TL;DR

Abstract

Learning to summarize user information for personalized reinforcement learning from human feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)