On Diversified Preferences of Large Language Model Alignment
Dun Zeng, Yong Dai, Pengyu Cheng, Longyue Wang, Tianhao Hu, Wanshun Chen, Nan Du, Zenglin Xu
TL;DR
This work analyzes how diversified human preferences influence reward-model-based alignment of large language models, demonstrating that model and data scale modulate the impact of preference diversity. It introduces Expected Calibration Error (ECE) as a key metric for RM reliability and reveals a positive link between RM calibration and LLM alignment quality. To address reward drift caused by mixed preferences, the authors propose MORE, a Multi-Objective Reward training scheme that adaptively reweights per-dataset losses to emphasize shared preferences. Experimental results across multiple datasets and base models show that MORE improves RM calibration and enhances RJS-based alignment, with calibration serving as a predictor of downstream performance. The study highlights the importance of data diversity, calibration metrics, and drift-mitigation strategies for robust, practical LLM alignment in diverse, real-world settings.
Abstract
Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified due to annotators' different tastes, which hinders the effectiveness of LLM alignment methods. This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes, from 1.3 billion to 7 billion parameters, trained with human feedback exhibiting diverse preferences. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them. To mitigate the impact of diverse preferences, we introduce a new metric, Expected Calibration Error (ECE), to evaluate RMs and show their obvious positive correlation with the alignment performance of LLMs. Furthermore, we propose a Multi-Objective Reward learning method (MORE) to enhance the calibration performance of RMs on shared preferences. Through experiments on four models and five human preference datasets, we find the calibration error can be adopted as a key metric for evaluating RMs and MORE can obtain superior alignment performance.
