Table of Contents
Fetching ...

On Diversified Preferences of Large Language Model Alignment

Dun Zeng, Yong Dai, Pengyu Cheng, Longyue Wang, Tianhao Hu, Wanshun Chen, Nan Du, Zenglin Xu

TL;DR

This work analyzes how diversified human preferences influence reward-model-based alignment of large language models, demonstrating that model and data scale modulate the impact of preference diversity. It introduces Expected Calibration Error (ECE) as a key metric for RM reliability and reveals a positive link between RM calibration and LLM alignment quality. To address reward drift caused by mixed preferences, the authors propose MORE, a Multi-Objective Reward training scheme that adaptively reweights per-dataset losses to emphasize shared preferences. Experimental results across multiple datasets and base models show that MORE improves RM calibration and enhances RJS-based alignment, with calibration serving as a predictor of downstream performance. The study highlights the importance of data diversity, calibration metrics, and drift-mitigation strategies for robust, practical LLM alignment in diverse, real-world settings.

Abstract

Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified due to annotators' different tastes, which hinders the effectiveness of LLM alignment methods. This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes, from 1.3 billion to 7 billion parameters, trained with human feedback exhibiting diverse preferences. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them. To mitigate the impact of diverse preferences, we introduce a new metric, Expected Calibration Error (ECE), to evaluate RMs and show their obvious positive correlation with the alignment performance of LLMs. Furthermore, we propose a Multi-Objective Reward learning method (MORE) to enhance the calibration performance of RMs on shared preferences. Through experiments on four models and five human preference datasets, we find the calibration error can be adopted as a key metric for evaluating RMs and MORE can obtain superior alignment performance.

On Diversified Preferences of Large Language Model Alignment

TL;DR

This work analyzes how diversified human preferences influence reward-model-based alignment of large language models, demonstrating that model and data scale modulate the impact of preference diversity. It introduces Expected Calibration Error (ECE) as a key metric for RM reliability and reveals a positive link between RM calibration and LLM alignment quality. To address reward drift caused by mixed preferences, the authors propose MORE, a Multi-Objective Reward training scheme that adaptively reweights per-dataset losses to emphasize shared preferences. Experimental results across multiple datasets and base models show that MORE improves RM calibration and enhances RJS-based alignment, with calibration serving as a predictor of downstream performance. The study highlights the importance of data diversity, calibration metrics, and drift-mitigation strategies for robust, practical LLM alignment in diverse, real-world settings.

Abstract

Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified due to annotators' different tastes, which hinders the effectiveness of LLM alignment methods. This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes, from 1.3 billion to 7 billion parameters, trained with human feedback exhibiting diverse preferences. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them. To mitigate the impact of diverse preferences, we introduce a new metric, Expected Calibration Error (ECE), to evaluate RMs and show their obvious positive correlation with the alignment performance of LLMs. Furthermore, we propose a Multi-Objective Reward learning method (MORE) to enhance the calibration performance of RMs on shared preferences. Through experiments on four models and five human preference datasets, we find the calibration error can be adopted as a key metric for evaluating RMs and MORE can obtain superior alignment performance.
Paper Structure (46 sections, 14 equations, 10 figures, 6 tables)

This paper contains 46 sections, 14 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of Diversified Preferences. Left: reward accuracy on each preference. Middle: the reward distribution of each RM on harmless preference. Right: the reward statistics of each RM on harmless preference. The solid box indicates the reward statistics on correct rewarded samples, and the hollow box indicates the wrong rewarded samples.
  • Figure 1: The RJS alignment performance with different RMs. The first line is the performance of the Alpaca base model. The results show that ECE further reflects the ability of RMs when the reward accuracy is close.
  • Figure 2: Multi-objective reward model training scheme (MORE), which consists of four steps: (1) collect a diversified batch of data from the mixed dataset; (2) calculate the RM gradient for each preference source; (3) minimize the reward drift to determine the scalar $(\lambda_1, \lambda_2, \dots, \lambda_K)$ for MORE loss; (4) update the RM with the re-weighted RM loss. Lower calibration error indicates the RM provides an accurate reward.
  • Figure 3: The reward accuracy of RMs with different training schemes on each dataset.
  • Figure 4: The ECE of the corresponding RMs.
  • ...and 5 more figures