RLHF and IIA: Perverse Incentives
Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy
TL;DR
The paper scrutinizes RLHF methods that rely on IIA-based choice models and shows that human preferences for language violate IIA, creating perverse incentives when learning rewards and tuning policies. Through theoretical analyses, simulations with a dichotomy data model, and an empirical study using PaLM2 with GPT-3.5/4-generated data, it demonstrates that RLHF algorithms like RLPO, DPO, IL, and SLiC can systematically misrepresent preferences as the number of alternatives per query grows. The results reveal that even innocuous changes to query formats can invert which messages are favored, highlighting a fundamental mismatch between IIA-assumed models and real user preferences. The authors call for new RLHF foundations and methods that do not rely on IIA, to enable robust innovations in query formats and learning algorithms with real-world textual preferences.
Abstract
Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.
