The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
Ali Aouad, Aymane El Gadarri, Vivek F. Farias
TL;DR
The paper tackles learning population-average utilities under user heterogeneity for LLM alignment, highlighting that standard RLHF cross-entropy estimation misweights individuals and distorts social welfare. It introduces the Sign estimator, which achieves ordinal consistency under symmetry and, for linear utility models, recovers the population direction $ar{eta}/ Vertar{eta} Vert$ with a cube-root finite-sample rate $ ilde{O}(n^{-1/3})$. The method avoids full mixture estimation and maintains compatibility with existing RLHF pipelines by replacing cross-entropy with a 0-1 loss proxy, enabling provable consistency and practical gains. Empirical results on digital-twin–driven preference data show substantial reductions in angular error (about $35 ext{%}$) and disagreement (about $40 ext{%}$) compared to RLHF, while outperforming EM-based panel methods in both accuracy and simplicity. The work provides a practical, theoretically justified alternative for robust aggregation of heterogeneous preferences in reward learning for language models.
Abstract
Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a naïve probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.
