Table of Contents
Fetching ...

Should I State or Should I Show? Aligning AI with Human Preferences

Keaton Ellis, Wanying Huang

Abstract

As AI agents become more autonomous, properly aligning their objectives with human preferences becomes increasingly important. We study how effectively an AI agent learns a human principal's preference in choice under risk via stated versus revealed preferences. We conduct an online experiment in which subjects state their preferences through written instructions ("prompts") and reveal them through choices in a series of binary lottery questions ("data"). We find that on average, an AI agent given revealed-preference data predicts subjects' choices more accurately than an AI agent given stated-preference prompts. Further analysis suggests that the gap is driven by subjects' difficulty in translating their own preferences into written instructions. When given a choice between which information source to give to an AI agent, a large portion of subjects fail to select the more informative one. Moreover, when predictions from the two sources conflict, we find that the AI agent aligns more frequently with the prompt, despite its lower accuracy. Overall, these results highlight the revealed preference approach as a powerful mechanism for communicating human preferences to AI agents, but its success depends on careful implementation.

Should I State or Should I Show? Aligning AI with Human Preferences

Abstract

As AI agents become more autonomous, properly aligning their objectives with human preferences becomes increasingly important. We study how effectively an AI agent learns a human principal's preference in choice under risk via stated versus revealed preferences. We conduct an online experiment in which subjects state their preferences through written instructions ("prompts") and reveal them through choices in a series of binary lottery questions ("data"). We find that on average, an AI agent given revealed-preference data predicts subjects' choices more accurately than an AI agent given stated-preference prompts. Further analysis suggests that the gap is driven by subjects' difficulty in translating their own preferences into written instructions. When given a choice between which information source to give to an AI agent, a large portion of subjects fail to select the more informative one. Moreover, when predictions from the two sources conflict, we find that the AI agent aligns more frequently with the prompt, despite its lower accuracy. Overall, these results highlight the revealed preference approach as a powerful mechanism for communicating human preferences to AI agents, but its success depends on careful implementation.

Paper Structure

This paper contains 21 sections, 3 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Comparison of Prompt-AI and Data-AI match rates ($N = 147$). Panel (a): Empirical CDF of per-subject match rates for Prompt-AI (blue) and Data-AI (green); the dashed vertical line marks 50%. Panel (b): Mean match rates by question category (Easy, Behavioral, Hard); the dashed horizontal line marks 50%. Error bars show 95% confidence intervals with standard errors clustered at the subject level. Stars denote paired $t$-tests comparing Data-AI to Prompt-AI within each category. $^{*}p<0.10$, $^{**}p<0.05$, $^{***}p<0.01$.
  • Figure 2: Mean match rates by a subject's number of behavioral effects observed in the Behavioral questions; the dashed horizontal line marks 50%. Error bars show 95% confidence intervals with standard errors clustered at the subject level. Stars denote paired $t$-tests comparing Data-AI to Prompt-AI within each group of subjects. $^{*}p<0.10$, $^{**}p<0.05$, $^{***}p<0.01$.
  • Figure 3: Comparison of Prompt-AI, Data-AI, and AutoPrompt-AI match rates ($N = 147$). AutoPrompt-AI (purple) uses a preference description auto-generated by Claude from the subject's Part I choices. Error bars show 95% confidence intervals with standard errors clustered at the subject level. Stars denote paired $t$-tests comparing AI agents. $^{*}p<0.10$, $^{**}p<0.05$, $^{***}p<0.01$.
  • Figure 4: Delegation choice and accuracy ($N = 147$). Panel (a): Mean Prompt-AI (blue) and Data-AI (green) match rates conditional on each subject's delegation choice. Panel (b): Fraction of subjects in each delegation group who chose the ex-post better agent (green), were tied (grey), or chose the ex-post worse agent (salmon). Error bars show 95% confidence intervals with standard errors clustered at the subject level. Stars denote paired $t$-tests comparing Prompt-AI to Data-AI within each delegation group or $t$-tests comparing an AI agent's match rate across groups. $^{*}p<0.10$, $^{**}p<0.05$, $^{***}p<0.01$.
  • Figure 5: Mean believed and actual Data-AI advantage by delegation group ($N = 147$). Lighter bars show the mean believed advantage (guessed Data-AI match rate minus guessed Prompt-AI match rate); darker bars show the actual realized advantage. Error bars show 95% confidence intervals with standard errors clustered at the subject level. Within-group brackets report paired $t$-tests comparing believed to actual advantage; the lower bracket compares the actual Data-AI advantage across delegation groups. $^{*}p<0.10$, $^{**}p<0.05$, $^{***}p<0.01$.
  • ...and 9 more figures