Mapping Social Choice Theory to RLHF

Jessica Dai; Eve Fleisig

Mapping Social Choice Theory to RLHF

Jessica Dai, Eve Fleisig

TL;DR

The paper investigates how social choice theory can inform RLHF by formalizing a RLHF-like setting where world-driven preferences are parameterized by $\theta$, text is represented with fixed features, and human feedback follows a Bradley–Terry–Luce model. It synthesizes perspectives from related work (notably Zhu 2023 and Cassidy & Anand) to contrast learning the full preference model $\theta$ with modeling a downstream utility $u$, and it discusses how annotator heterogeneity, annotation distortions, and sampling choices impact downstream policies. The analysis identifies key assumptions that RLHF often makes more strongly than social choice theory would, and highlights potential failure modes such as non-IID noise and context effects that can distort preference aggregation. Overall, it offers a roadmap for robust RLHF design by examining sampling, uncertainty in the preference model, and the role of hidden context in downstream decision-making, with practical implications for constructing and interpreting human judgments in real-world systems.

Abstract

Recent work on the limitations of using reinforcement learning from human feedback (RLHF) to incorporate human preferences into model behavior often raises social choice theory as a reference point. Social choice theory's analysis of settings such as voting mechanisms provides technical infrastructure that can inform how to aggregate human preferences amid disagreement. We analyze the problem settings of social choice and RLHF, identify key differences between them, and discuss how these differences may affect the RLHF interpretation of well-known technical results in social choice.

Mapping Social Choice Theory to RLHF

TL;DR

The paper investigates how social choice theory can inform RLHF by formalizing a RLHF-like setting where world-driven preferences are parameterized by

, text is represented with fixed features, and human feedback follows a Bradley–Terry–Luce model. It synthesizes perspectives from related work (notably Zhu 2023 and Cassidy & Anand) to contrast learning the full preference model

with modeling a downstream utility

, and it discusses how annotator heterogeneity, annotation distortions, and sampling choices impact downstream policies. The analysis identifies key assumptions that RLHF often makes more strongly than social choice theory would, and highlights potential failure modes such as non-IID noise and context effects that can distort preference aggregation. Overall, it offers a roadmap for robust RLHF design by examining sampling, uncertainty in the preference model, and the role of hidden context in downstream decision-making, with practical implications for constructing and interpreting human judgments in real-world systems.

Abstract

Paper Structure (11 sections, 16 equations)

This paper contains 11 sections, 16 equations.

Model
Preferences
Text
Data generation
Learning
Our Questions
Datasets
Related Work
Eve's notes on trying to combine ideas from both papers
Stronger assumptions made by RLHF
Things that can go wrong

Mapping Social Choice Theory to RLHF

TL;DR

Abstract

Mapping Social Choice Theory to RLHF

Authors

TL;DR

Abstract

Table of Contents