Mapping Social Choice Theory to RLHF
Jessica Dai, Eve Fleisig
TL;DR
The paper investigates how social choice theory can inform RLHF by formalizing a RLHF-like setting where world-driven preferences are parameterized by $\theta$, text is represented with fixed features, and human feedback follows a Bradley–Terry–Luce model. It synthesizes perspectives from related work (notably Zhu 2023 and Cassidy & Anand) to contrast learning the full preference model $\theta$ with modeling a downstream utility $u$, and it discusses how annotator heterogeneity, annotation distortions, and sampling choices impact downstream policies. The analysis identifies key assumptions that RLHF often makes more strongly than social choice theory would, and highlights potential failure modes such as non-IID noise and context effects that can distort preference aggregation. Overall, it offers a roadmap for robust RLHF design by examining sampling, uncertainty in the preference model, and the role of hidden context in downstream decision-making, with practical implications for constructing and interpreting human judgments in real-world systems.
Abstract
Recent work on the limitations of using reinforcement learning from human feedback (RLHF) to incorporate human preferences into model behavior often raises social choice theory as a reference point. Social choice theory's analysis of settings such as voting mechanisms provides technical infrastructure that can inform how to aggregate human preferences amid disagreement. We analyze the problem settings of social choice and RLHF, identify key differences between them, and discuss how these differences may affect the RLHF interpretation of well-known technical results in social choice.
