Table of Contents
Fetching ...

Capturing Individual Human Preferences with Reward Features

André Barreto, Vincent Dumoulin, Yiran Mao, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle

TL;DR

This work tackles the problem that RLHF reward models, when trained on mixed human preferences, can fail to capture individual differences. It introduces a reward-feature model (RFM) that expresses a per-user reward as $r_h(x,y) = \langle \boldsymbol{\phi}(x,y), \boldsymbol{w}_h \rangle$, with pairwise preferences governed by $p(y \succ y'|x,h) = \sigma(\langle \boldsymbol{\phi}(x,y) - \boldsymbol{\phi}(x,y'), \boldsymbol{w}_h \rangle)$. During training, $\boldsymbol{\theta}$ (defining the reward features) and $\boldsymbol{W}$ (rater-specific weights) are learned from data $D^+$; adaptation to a new user freezes $\boldsymbol{\theta}$ and optimises a new $\boldsymbol{w}$ on a small $D_h$ via logistic regression. Empirical results across scenarios with varying disagreement show that RFM can significantly outperform non-adaptive baselines and competitive adaptive methods, and can match or exceed in-context personalisation performance, especially as disagreement increases; the approach also demonstrates transfer to actual LLM output modulation through best-of-$n$ re-ranking. The work suggests that reward personalisation via shared features plus per-user adapters can robustly reflect minority viewpoints, with implications for safer and more inclusive RLHF systems across modalities.

Abstract

Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.

Capturing Individual Human Preferences with Reward Features

TL;DR

This work tackles the problem that RLHF reward models, when trained on mixed human preferences, can fail to capture individual differences. It introduces a reward-feature model (RFM) that expresses a per-user reward as , with pairwise preferences governed by . During training, (defining the reward features) and (rater-specific weights) are learned from data ; adaptation to a new user freezes and optimises a new on a small via logistic regression. Empirical results across scenarios with varying disagreement show that RFM can significantly outperform non-adaptive baselines and competitive adaptive methods, and can match or exceed in-context personalisation performance, especially as disagreement increases; the approach also demonstrates transfer to actual LLM output modulation through best-of- re-ranking. The work suggests that reward personalisation via shared features plus per-user adapters can robustly reflect minority viewpoints, with implications for safer and more inclusive RLHF systems across modalities.

Abstract

Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.

Paper Structure

This paper contains 27 sections, 15 equations, 7 figures.

Figures (7)

  • Figure 1: Intra-user test accuracy: accuracy in predicting the preferences of raters on test set during training in Scenario 1 (estimate of intra-user generalisation). Showing two runs per model.
  • Figure 2: Inter-user test accuracy: accuracy in predicting the preferences of held-out users on test set after adaptation (estimate of inter-user generalisation). Scenario 1: UltraFeedback features and raters sampled from normal distributions with one-hot means. Scenario 2: UltraFeedback features and raters sampled from normal distributions whose means are all permutations of the vector $[1, -1, 0, 0]^\top$. Scenario 3: Simple functions used as features and raters as in Scenario 2. The scale of the $y$-axis is defined per row. Error bars are $95\%$ confidence intervals over $10$ runs ($5$ adaptation runs on top of $2$ training runs).
  • Figure 3: Accuracy in predicting the preferences of held-out users on test set under Scenario 2. All models used $m=10$ examples for adaptation. The LLMs went over the test set once; for the non-adaptive baseline and RFM error bars are $95\%$ confidence intervals over $10$ runs ($5$ adaptation runs on top of $2$ training runs).
  • Figure 4: Fraction of times the best-of-$n$ response computed based on the scores of each model has the highest ground-truth score in Scenario 3. RFM used $m=30$ examples for adaptation. Statistics computed over $10$ runs ($5$ adaptation runs on top of $2$ training runs).
  • Figure 5: Inter-user test accuracy: accuracy in predicting the preferences of held-out "reward-models users" on test set after adaptation (estimate of inter-user generalisation). Error bars are $95\%$ confidence intervals over $10$ runs ($5$ adaptation runs on top of $2$ training runs).
  • ...and 2 more figures