Table of Contents
Fetching ...

The Limits of Preference Data for Post-Training

Eric Zhao, Jessica Dai, Pranjal Awasthi

TL;DR

The paper shows that relying solely on ordinal preference data to drive post-training of language models imposes fundamental limits, even with infinite, noiseless, online data. By formalizing post-training as routing queries to a fixed set of pretrained circuits and mapping this to a social-choice distortion framework, the authors derive lower bounds that bound how close any preference-based method can get to the optimum, with distortion growing with model complexity. It further argues that reasoning tasks are particularly affected because robustness strategies (e.g., backtracking) are penalized by ordinal preferences, explaining why RLHF has struggled to enhance reasoning relative to instruction-tuning and safety tasks. The work suggests practical mitigations, such as incorporating cardinal feedback or task-specific labeling cues, and calls for grounded scoring and algorithmic innovations to extend RL-based post-training to human-feedback-dependent domains.

Abstract

Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$-wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.

The Limits of Preference Data for Post-Training

TL;DR

The paper shows that relying solely on ordinal preference data to drive post-training of language models imposes fundamental limits, even with infinite, noiseless, online data. By formalizing post-training as routing queries to a fixed set of pretrained circuits and mapping this to a social-choice distortion framework, the authors derive lower bounds that bound how close any preference-based method can get to the optimum, with distortion growing with model complexity. It further argues that reasoning tasks are particularly affected because robustness strategies (e.g., backtracking) are penalized by ordinal preferences, explaining why RLHF has struggled to enhance reasoning relative to instruction-tuning and safety tasks. The work suggests practical mitigations, such as incorporating cardinal feedback or task-specific labeling cues, and calls for grounded scoring and algorithmic innovations to extend RL-based post-training to human-feedback-dependent domains.

Abstract

Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or -wise) that indicate, for given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.

Paper Structure

This paper contains 42 sections, 9 theorems, 32 equations, 6 figures, 2 tables.

Key Result

Theorem 3.3

Consider any pretrained model $M_0 = (\phi_0, g_0, \mathcal{S}_0)$ and post-training algorithm $\mathcal{A}$. There always exists a post-training objective, i.e. a utility $u: \mathcal{Q} \times \mathcal{R} \to \mathbb{R}$ we wish to maximize, such that: if we post-train $M_0$ on noiseless preferenc when $|\mathcal{Q}| \gg |\mathcal{S}|, |\mathcal{Z}|$. Moreover, this lower bound holds even if we

Figures (6)

  • Figure 1.1: Overview of our motivation and results. Left: RLHF and RLVR have demonstrated empirical success on alignment and close-ended reasoning tasks, respectively. Our investigation of preference data is motivated by reasoning tasks that require human feedback. Right: Comparison of scalar rewards vs. ordinal preferences as data modalities. Our impossibility result is due to a connection between the post-training of models using preference data and the analysis of electoral systems in social choice theory.
  • Figure 4.1: Robustness as a reasoning strategy.
  • Figure 4.2: Non-reasoning models contain robust behaviors.
  • Figure 4.3: Comparison of LM (Gemini 2.0 Pro) preferences for succinct, non-backtracking responses versus lengthy, backtracking responses, when both final answers are correct.
  • Figure C.1: Deepseek R1's response to a LiveBench Reasoning question; the model double-checks a stated fact.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Definition 3.1: Borda count
  • Example 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Theorem 3.5: Corollary 2 of amanatidis2021peeking
  • Theorem A.1
  • Theorem A.1: Generalization of \ref{['theorem:bounded_computation_formal']}
  • proof
  • Lemma A.1
  • Lemma A.1
  • ...and 7 more