Table of Contents
Fetching ...

Reinforcement Learning from User Feedback

Eric Han, Jun Chen, Karthik Abinav Sankararaman, Xiaoliang Peng, Tengyu Xu, Eryk Helenowski, Kaiyan Peng, Mrinal Kumar, Sinong Wang, Han Fang, Arya Talebzadeh

TL;DR

RLUF shifts LLM alignment from expert-derived labels to real user signals collected in production, using a lightweight binary proxy ($P[\text{Love}]$) derived from Love Reactions. A three-model reward suite (Love, Helpfulness, Safety) feeds a Mixture of Judges-based multi-objective optimization to balance user delight with core alignment goals. Offline evaluations show $P[\text{Love}]$ strongly predicts online outcomes (Pearson $r=0.95$) and production A/B tests yield up to $+28\%$ Love Reactions, though reward hacking emerges as a notable risk. The work demonstrates a scalable, data-driven path to user-aligned LLMs at scale, while highlighting the need for stronger anti-hacking controls and richer signals to safely deploy production models.

Abstract

As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.

Reinforcement Learning from User Feedback

TL;DR

RLUF shifts LLM alignment from expert-derived labels to real user signals collected in production, using a lightweight binary proxy () derived from Love Reactions. A three-model reward suite (Love, Helpfulness, Safety) feeds a Mixture of Judges-based multi-objective optimization to balance user delight with core alignment goals. Offline evaluations show strongly predicts online outcomes (Pearson ) and production A/B tests yield up to Love Reactions, though reward hacking emerges as a notable risk. The work demonstrates a scalable, data-driven path to user-aligned LLMs at scale, while highlighting the need for stronger anti-hacking controls and richer signals to safely deploy production models.

Abstract

As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.

Paper Structure

This paper contains 39 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the RLUF pipeline from left to right: We begin with raw user-llm conversations and binary feedback signals attached to each turn. We then train user signal reward models and combine them with existing reward models in a multi-objective reinforcement learning framework. This produces a user-aligned language model which has the desired property of improving user satisfaction.
  • Figure 2: Correlation between binary user feedback signals and 14-day retention. Love Reactions show the highest positive correlation with user retention. Thumbs up is positively correlated with retention, while thumbs down is negatively correlated with retention.
  • Figure 3: High correlation (0.95 Pearson) between average $P[\text{Love}]$ reward scores on a fixed prompt set and online Love Reaction rate during A/B testing. Numbers redacted.
  • Figure 4: We compare our two Love-optimized LLM candidates against the baseline LLM candidate. The aggressive LLM candidate increases the love RM score more than the moderate candidate, but causes greater regression in helpfulness. Increasing optimization budget for Love reward leaves less optimization budget for climbing helpfulness and safety.
  • Figure 5: Change in Love Reaction Rate split by Job To Be Done (JTBD) during A/B tests comparing each Love-optimized LLM candidate against the baseline LLM candidate. We find the greatest increases in love reaction rate in emotionally oriented categories.
  • ...and 2 more figures