Reinforcement Learning from User Feedback

Eric Han; Jun Chen; Karthik Abinav Sankararaman; Xiaoliang Peng; Tengyu Xu; Eryk Helenowski; Kaiyan Peng; Mrinal Kumar; Sinong Wang; Han Fang; Arya Talebzadeh

Reinforcement Learning from User Feedback

Eric Han, Jun Chen, Karthik Abinav Sankararaman, Xiaoliang Peng, Tengyu Xu, Eryk Helenowski, Kaiyan Peng, Mrinal Kumar, Sinong Wang, Han Fang, Arya Talebzadeh

TL;DR

RLUF shifts LLM alignment from expert-derived labels to real user signals collected in production, using a lightweight binary proxy ($P[\text{Love}]$) derived from Love Reactions. A three-model reward suite (Love, Helpfulness, Safety) feeds a Mixture of Judges-based multi-objective optimization to balance user delight with core alignment goals. Offline evaluations show $P[\text{Love}]$ strongly predicts online outcomes (Pearson $r=0.95$) and production A/B tests yield up to $+28\%$ Love Reactions, though reward hacking emerges as a notable risk. The work demonstrates a scalable, data-driven path to user-aligned LLMs at scale, while highlighting the need for stronger anti-hacking controls and richer signals to safely deploy production models.

Abstract

As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.

Reinforcement Learning from User Feedback

TL;DR

Abstract

Reinforcement Learning from User Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)