Table of Contents
Fetching ...

Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

Judy Hanwen Shen, Archit Sharma, Jun Qin

TL;DR

This work systematically study preference datasets through three perspectives: scale, label noise, and information content and proposes specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets.

Abstract

The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.

Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

TL;DR

This work systematically study preference datasets through three perspectives: scale, label noise, and information content and proposes specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets.

Abstract

The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.
Paper Structure (21 sections, 10 equations, 16 figures, 1 table)

This paper contains 21 sections, 10 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Scaling behavior when measuring evaluation set accuracy is dataset dependent.
  • Figure 2: Comparing RewardBench performance across different datasets for Llama2-7B-chat model. Increasing the dataset size does not improve performance for most datasets on most tasks.
  • Figure 3: Empirical CDF of $P(y_w \succ y_l)$ for different datasets at different noise levels for Llama 7B on RewardBench. When there is no noise, some datasets induce a more confident distribution even with the same number of training examples. As more noise is added, all probabilities shift towards 0.5 and the datasets become indistinguishable
  • Figure 4: The impact of noise on reward model confidence $P(y_w \succ y_l)$ on UltraFeedback for RewardBench. We see that as the noise rate (% of flipped labels) increases, the probability of the winning response being chosen concentrates around $0.5$. This phenomenon is similar across all models and datasets to different extents.
  • Figure 5: (Left) Distribution of cosine similarity of response pairs for different datasets. The HH-RLHF dataset contains much more similar response pairs (e.g. $(y_w, y_l)$) than the UltraFeedback dataset. (Right) The evaluation set accuracy for training different models with "high information" or low response similarity data compared to a random sample. The benefits of "high information" are most salient in the smallest model.
  • ...and 11 more figures