Table of Contents
Fetching ...

Learning to summarize user information for personalized reinforcement learning from human feedback

Hyunji Nam, Yanming Wan, Mickel Liu, Peter Ahnn, Jianxun Lian, Natasha Jaques

TL;DR

This work tackles the limitation of RLHF by modeling diverse user preferences with a user-conditioned reward model. It introduces PLUS, which jointly learns a text-based user summary $z$ from context $c$ via a summarizer and a reward model $r_\phi(s|z)$, in an online co-adaptive loop using PPO. PLUS achieves 11–77% gains in reward-model accuracy over traditional BTL and demonstrates robustness to topic shifts and new users, including zero-shot personalization of proprietary models like GPT-4o. The approach yields human-readable, interpretable user representations that enable personalized responses and improved transparency in LLM alignment, with demonstrated benefits on real-world pluralistic datasets such as PRISM.

Abstract

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley-Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11-77/% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25\% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72\% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.

Learning to summarize user information for personalized reinforcement learning from human feedback

TL;DR

This work tackles the limitation of RLHF by modeling diverse user preferences with a user-conditioned reward model. It introduces PLUS, which jointly learns a text-based user summary from context via a summarizer and a reward model , in an online co-adaptive loop using PPO. PLUS achieves 11–77% gains in reward-model accuracy over traditional BTL and demonstrates robustness to topic shifts and new users, including zero-shot personalization of proprietary models like GPT-4o. The approach yields human-readable, interpretable user representations that enable personalized responses and improved transparency in LLM alignment, with demonstrated benefits on real-world pluralistic datasets such as PRISM.

Abstract

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley-Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11-77/% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25\% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72\% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.

Paper Structure

This paper contains 33 sections, 6 equations, 4 figures, 14 tables, 3 algorithms.

Figures (4)

  • Figure 1: Caution: This content may reflect particular beliefs. While standard RLHF techniques fail to capture user variability, PLUS trains both a summarizer and reward model in an online co-adaptive framework to learn summaries $z$ useful for predicting diverse preferences. Italicized texts are actual outputs by GPT-4o and PLUS showing the effects of summaries on personalization.
  • Figure 2: Eq. \ref{['eq:loss_minimization']} naturally leads to the online co-adaptation of the summarizer and the reward model.
  • Figure 3: PPO training return curves for Pets, Ultrafeedback P2, and PRISM using Qwen3B-Instruct summarizer and Qwen0.5B reward model with a rollout batch size 256. The returns obtained by the summarizer are the negative prediction (log-likelihood) loss of the reward model.
  • Figure 4: Win rates of personalized vs default GPT-4 on the PRISM dataset (evaluated with two prompts per 308 unseen users).