Table of Contents
Fetching ...

Influencing Humans to Conform to Preference Models for RLHF

Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Scott Niekum, Peter Stone

TL;DR

This paper tackles misalignment between RLHF's assumed human-preference model and actual human behavior by proposing three intervention strategies to steer how people express preferences toward a chosen model (partial return or regret) without altering the underlying reward. It demonstrates, across privileged, trained, and question-based interventions, that human preference data can be significantly shaped to better conform to a specified model, improving the learned reward function via standard RLHF objectives. The work introduces a novel direction in model alignment: designing interfaces and training to align human input with the modeling assumptions of the learning algorithm, offering practical tools to enhance data quality and alignment. The findings have practical implications for RLHF practitioners and open avenues for extending interface-driven alignment to more complex and real-world domains.

Abstract

Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.

Influencing Humans to Conform to Preference Models for RLHF

TL;DR

This paper tackles misalignment between RLHF's assumed human-preference model and actual human behavior by proposing three intervention strategies to steer how people express preferences toward a chosen model (partial return or regret) without altering the underlying reward. It demonstrates, across privileged, trained, and question-based interventions, that human preference data can be significantly shaped to better conform to a specified model, improving the learned reward function via standard RLHF objectives. The work introduces a novel direction in model alignment: designing interfaces and training to align human input with the modeling assumptions of the learning algorithm, offering practical tools to enhance data quality and alignment. The findings have practical implications for RLHF practitioners and open avenues for extending interface-driven alignment to more complex and real-world domains.

Abstract

Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.
Paper Structure (45 sections, 4 equations, 27 figures, 6 tables)

This paper contains 45 sections, 4 equations, 27 figures, 6 tables.

Figures (27)

  • Figure 1: Our proposed method of influencing human preferences. We design interfaces to influence the human's preferences without changing their underlying reward function.
  • Figure 2: On each step, the vehicle receives reward as the sum of reward components: $-1$ for every time it moves; $+1$ for collecting a coin; and $+50$ for reaching the red goal marker, which ends the episode. The partial return preference model favors the left trajectory, while regret favors the right.
  • Figure 3: The delivery task shown to human subjects for gathering preferences. The yellow vehicle is the agent, and its objective is to maximize its score. Score maximization requires reaching the red inverted teardrop.
  • Figure 4: The baseline preference elicitation interface shown to humans annotators. All three of our experiments involve changes to this interface, whether by adding additional metrics (privileged experiment), training the human before eliciting preferences (trained experiment), or changing the elicitation question (question experiment).
  • Figure 5: For the privileged experiment, the mean cross-entropy loss over each condition's preference dataset with respect to the target preference model. If the loss is lower for an intervention's dataset than for the Privileged-Control dataset, then the former is better predicted by the target preference model. Performing a Mann-Whitney U test results in $p<0.01$ for both conditions.
  • ...and 22 more figures