Capturing Individual Human Preferences with Reward Features
André Barreto, Vincent Dumoulin, Yiran Mao, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle
TL;DR
This work tackles the problem that RLHF reward models, when trained on mixed human preferences, can fail to capture individual differences. It introduces a reward-feature model (RFM) that expresses a per-user reward as $r_h(x,y) = \langle \boldsymbol{\phi}(x,y), \boldsymbol{w}_h \rangle$, with pairwise preferences governed by $p(y \succ y'|x,h) = \sigma(\langle \boldsymbol{\phi}(x,y) - \boldsymbol{\phi}(x,y'), \boldsymbol{w}_h \rangle)$. During training, $\boldsymbol{\theta}$ (defining the reward features) and $\boldsymbol{W}$ (rater-specific weights) are learned from data $D^+$; adaptation to a new user freezes $\boldsymbol{\theta}$ and optimises a new $\boldsymbol{w}$ on a small $D_h$ via logistic regression. Empirical results across scenarios with varying disagreement show that RFM can significantly outperform non-adaptive baselines and competitive adaptive methods, and can match or exceed in-context personalisation performance, especially as disagreement increases; the approach also demonstrates transfer to actual LLM output modulation through best-of-$n$ re-ranking. The work suggests that reward personalisation via shared features plus per-user adapters can robustly reflect minority viewpoints, with implications for safer and more inclusive RLHF systems across modalities.
Abstract
Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.
