Table of Contents
Fetching ...

Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback

Amirhossein Afsharrad, Ruida Zhou, Luca Viano, Sanjay Lall, Mohammad Ghavamzadeh

TL;DR

This work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.

Abstract

Reward modeling is crucial for aligning large language models with human preferences, yet current approaches lack a principled mathematical framework for leveraging ordinal preference data. When human annotators provide graded preferences on a Likert scale (e.g., significantly better, better, slightly better, negligibly better), existing methods typically apply ad-hoc heuristics, such as margin terms or scaling factors, to loss functions derived from binary preference models like Bradley-Terry. These approaches lack an underlying mathematical model for how ordinal preference data is generated. We present a theoretically grounded framework that formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem. We derive two loss functions from this formulation: a negative log-likelihood loss and an all-threshold loss, both of which learn threshold parameters that naturally capture the ordinal structure of preferences. Unlike existing heuristic methods that manually specify fixed margins or scaling weights, our approach learns these parameters directly from data within a coherent probabilistic framework. Experimental results on multiple benchmarks demonstrate that our ordinal regression approach consistently achieves competitive or superior performance compared to existing heuristic methods across diverse evaluation categories including chat, reasoning, and safety tasks. Our work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.

Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback

TL;DR

This work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.

Abstract

Reward modeling is crucial for aligning large language models with human preferences, yet current approaches lack a principled mathematical framework for leveraging ordinal preference data. When human annotators provide graded preferences on a Likert scale (e.g., significantly better, better, slightly better, negligibly better), existing methods typically apply ad-hoc heuristics, such as margin terms or scaling factors, to loss functions derived from binary preference models like Bradley-Terry. These approaches lack an underlying mathematical model for how ordinal preference data is generated. We present a theoretically grounded framework that formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem. We derive two loss functions from this formulation: a negative log-likelihood loss and an all-threshold loss, both of which learn threshold parameters that naturally capture the ordinal structure of preferences. Unlike existing heuristic methods that manually specify fixed margins or scaling weights, our approach learns these parameters directly from data within a coherent probabilistic framework. Experimental results on multiple benchmarks demonstrate that our ordinal regression approach consistently achieves competitive or superior performance compared to existing heuristic methods across diverse evaluation categories including chat, reasoning, and safety tasks. Our work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.
Paper Structure (100 sections, 2 theorems, 50 equations, 11 figures, 6 tables, 3 algorithms)

This paper contains 100 sections, 2 theorems, 50 equations, 11 figures, 6 tables, 3 algorithms.

Key Result

Theorem 3.1

Consider either the NLL loss $\mathcal{L}_{\text{NLL}}$ or the AT loss $\mathcal{L}_{\text{AT}}$ without regularization. Suppose there exists a solution $(r^*_\phi, \zeta^*)$ such that all training examples are correctly ordered, i.e., $s^*_i \in (\zeta^*_{z_i-1}, \zeta^*_{z_i})$ for all $(x_i, y_i,

Figures (11)

  • Figure 1: Evolution of symmetric threshold parameters during training with different L2 regularization weights. The plots show zeta_0 through zeta_5 which correspond to the ordinal thresholds $\zeta_{-3}$ through $\zeta_{3}$ in our formulation (excluding $\zeta_0$ which is not parameterized). Due to symmetry constraints, zeta_0 = $-\zeta_3$, zeta_1 = $-\zeta_2$, and zeta_2 = $-\zeta_1$ in the plots. Without regularization ($\lambda=0$, red), thresholds exhibit unbounded growth. Moderate regularization ($\lambda=0.1$, blue) allows gradual convergence with more flexibility, while stronger regularization ($\lambda=1$, green) enforces rapid convergence to stable values.
  • Figure 2: Multi-phase threshold evolution during training with moderate regularization ($\lambda=0.1$). The thresholds exhibit step-like progression with distinct stable phases separated by rapid transition periods. Outer thresholds (zeta_0 and zeta_5) show the most pronounced phase transitions.
  • Figure 3: Training reward values for chosen and rejected responses across different regularization settings. Without regularization ($\lambda=0$, red), rewards diverge significantly with chosen responses receiving increasingly large positive rewards and rejected ones receiving increasingly negative rewards. With regularization ($\lambda=0.1$ in blue, $\lambda=1$ in green), reward values remain bounded and stable throughout training.
  • Figure 4: Evaluation reward values for chosen (left) and rejected (right) responses across different regularization settings. Without regularization (red), both chosen and rejected rewards diverge to extreme values. With regularization (blue and green), both reward types remain bounded and stable.
  • Figure 5: Ordinal performance metrics on the training set. MAE decreases rapidly in early training, stabilizing around 0.2. Exact accuracy reaches approximately 85%, while near-perfect accuracy (Within 1) approaches 95%, indicating strong ordinal prediction on training data.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Unbounded Solution Set
  • Theorem 3.2: Threshold Symmetry under Ordered Logit