Table of Contents
Fetching ...

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin

TL;DR

It is empirically demonstrated that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.

Abstract

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

TL;DR

It is empirically demonstrated that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.

Abstract

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements.

Paper Structure

This paper contains 36 sections, 3 theorems, 7 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Define $R^x_0=r^\star(x,Y_0)$ with $Y_0\sim\pi_0(\cdot\mid x)$ and $F^x_0$ as its cumulative distribution function. The RFT solution eq:rlhf-sol has:

Figures (6)

  • Figure 1: Chasing the Tail with Rubric-Based Rewards
  • Figure 2: Theoretical impact of reward model misspecification on performance. (a) Inaccuracy in the high-value region causes performance to collapse. (b) Correctly ranking top responses is sufficient for near-optimal performance.
  • Figure 3: Rubric refinement through response differentiation. (a) Single-round: A proposer LLM analyzes a pair of responses to identify distinguishing features and encodes them as new rubric criteria. (b) Iterative: Multiple rounds progressively focus on higher-quality responses, with each iteration filtering to top-scoring candidates before generating new differentiating rubrics.
  • Figure 4: Model performance metrics across training steps for different rubric construction strategies. The figures show the evolution of the Win Rate and respective benchmark scores in the healthcare (\ref{['subfig:healthbench_scores']}) and finance (\ref{['subfig:finance_scores']}) domains.
  • Figure 5: The Distribution of Rubrics Targeting Each Type of Model Capabilities
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Theorem 1
  • Theorem 1
  • proof