Table of Contents
Fetching ...

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li

TL;DR

A framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity and proves the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback.

Abstract

Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as "tied" between the two responses) and loses more fine-grained information (such as "slightly better"). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the sociological concept of the wisdom of the crowd. Under the condition, we develop a natural probability model for pairwise preference data under ordinal feedback and analyze its properties. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback. The proposed learning objective and the theory also extend to hinge loss and direct policy optimization (DPO). In particular, the theoretical analysis may be of independent interest when applying to a seemingly unrelated problem of knowledge distillation to interpret the bias-variance trade-off therein. The framework also sheds light on writing guidance for human annotators. Our numerical experiments validate that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings. Further experiments show that incorporating a certain proportion of samples with tied preference boosts RM learning.

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

TL;DR

A framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity and proves the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback.

Abstract

Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as "tied" between the two responses) and loses more fine-grained information (such as "slightly better"). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the sociological concept of the wisdom of the crowd. Under the condition, we develop a natural probability model for pairwise preference data under ordinal feedback and analyze its properties. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback. The proposed learning objective and the theory also extend to hinge loss and direct policy optimization (DPO). In particular, the theoretical analysis may be of independent interest when applying to a seemingly unrelated problem of knowledge distillation to interpret the bias-variance trade-off therein. The framework also sheds light on writing guidance for human annotators. Our numerical experiments validate that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings. Further experiments show that incorporating a certain proportion of samples with tied preference boosts RM learning.

Paper Structure

This paper contains 29 sections, 13 theorems, 82 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.2

For any ordinal feedback set $\mathcal{Z} = \{z_1, \dots, z_m\}$ and any oracle model $z_{\text{oracle}}(x, y_1, y_2)$, one can construct an ordinal feedback $Z$ as a random variable that satisfies Assumption assm:wisdom_of_crowd in the following way. Specifically, if $z_{\text{oracle}} \in [z_j, z_ The ordinal feedback $Z$ fulfills Assumption assm:wisdom_of_crowd. On the other hand, any ordinal f

Figures (6)

  • Figure 1: Wisdom of the crowd. Left: Each individual guess can be far off the target for an ox-weight-guessing social experiment, but the average tends to be very accurate. Each human annotator has not access to the population oracle preference model $z_{\text{oracle}}$, but their annotation constitutes an unbiased realization of $z_{\text{oracle}}$.
  • Figure 2: The evaluation dynamics of llama and gemma models for different ordinal feedback labels.
  • Figure 3: The evaluation dynamics of llama and gemma models for different tied data ratios. The 100%-tied case is not plotted as it would detract from the clarity and readability of the plot due to its failure.
  • Figure 4: The evaluation dynamics of llama models for different ordinal feedback labels under generalized hinge loss.
  • Figure 5: Distributions of preference strengths in the two datasets. For the UltraFeedback dataset, we compare the chosen and rejected scores pairwisely and use their differences as preference strengths. For the HelpSteer2 dataset, its latest version provides a preference strength label and we directly adopt it.
  • ...and 1 more figures

Theorems & Definitions (27)

  • Definition 2.1: Ordinal Feedback
  • Theorem 3.2
  • Definition 4.1: Feedback Affinity
  • Proposition 4.2
  • Proposition 4.3
  • Definition 4.4: Coupling
  • Definition 4.5: Hierarchical Expectation
  • Proposition 4.6: Existence of Hierarchical Expectation
  • Corollary 4.7
  • Definition 4.8: Rademacher Complexity
  • ...and 17 more