Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

William F. Shen; Xinchi Qiu; Chenxi Whitehouse; Lisa Alazraki; Shashwat Goel; Francesco Barbieri; Timon Willi; Akhil Mathur; Ilias Leontiadis

Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

William F. Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, Ilias Leontiadis

TL;DR

RRD introduces a principled, recursive rubric refinement framework to improve both LLM judging and reward modeling for open-ended tasks. By decomposing broad criteria into fine-grained, discriminative rubrics, filtering misaligned and redundant signals, and applying correlation-aware (whitened) weighting, RRD achieves stronger judge-accuracy and more stable, higher-quality rewards for reinforcement fine-tuning. Empirical results show large gains on JudgeBench and PPE across GPT-4o and Llama3.1-405B judges, and substantial improvements in RFT signals and downstream policy performance on BiGGen Bench and HealthBench-Hard, with gains transferring to high-stakes domains. Overall, recursive rubric refinement provides a scalable, interpretable foundation for aligning LLMs in open-ended evaluation and generation tasks.

Abstract

Recently, rubrics have been used to guide LLM judges in capturing subjective, nuanced, multi-dimensional human preferences, and have been extended from evaluation to reward signals for reinforcement fine-tuning (RFT). However, rubric generation remains hard to control: rubrics often lack coverage, conflate dimensions, misalign preference direction, and contain redundant or highly correlated criteria, degrading judge accuracy and producing suboptimal rewards during RFT. We propose RRD, a principled framework for rubric refinement built on a recursive decompose-filter cycle. RRD decomposes coarse rubrics into fine-grained, discriminative criteria, expanding coverage while sharpening separation between responses. A complementary filtering mechanism removes misaligned and redundant rubrics, and a correlation-aware weighting scheme prevents over-representing highly correlated criteria, yielding rubric sets that are informative, comprehensive, and non-redundant. Empirically, RRD delivers large, consistent gains across both evaluation and training: it improves preference-judgment accuracy on JudgeBench and PPE for both GPT-4o and Llama3.1-405B judges, achieving top performance in all settings with up to +17.7 points on JudgeBench. When used as the reward source for RFT on WildChat, it yields substantially stronger and more stable learning signals, boosting reward by up to 160% (Qwen3-4B) and 60% (Llama3.1-8B) versus 10-20% for prior rubric baselines, with gains that transfer to HealthBench-Hard and BiGGen Bench. Overall, RRD establishes recursive rubric refinement as a scalable and interpretable foundation for LLM judging and reward modeling in open-ended domains.

Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

TL;DR

Abstract

Paper Structure (36 sections, 3 theorems, 32 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 3 theorems, 32 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
RRD Framework
Rubric-based Judge Overview
Theoretical Grounding for Rubric Quality
Methodology: Recursive Rubric Decomposition (RRD)
RRD-based LLM Judge Results
Dataset.
Baselines.
Results
Ablations.
RRD-based RFT
Dataset and Training.
Models
Results
Reward Dynamics during Training.
...and 21 more sections

Key Result

Lemma 1

For any $w\neq 0$,

Figures (5)

Figure 1: RRD consistently outperforms all baselines on both JudgeBench and PPE for both proprietary (GPT-4o) and open-weights (Llama3.1-405B) judges, delivering substantial gains in preference-judgment accuracy.
Figure 2: Overview of RRD framework: RRD consists of three stages: (I) Initial Rubric Proposal. LLM proposes initial candidate rubrics (conditioned on the task prompt and sample responses) for optimization. (II) Recursive Decomposition and Filtering. Recursively decompose coarse rubric into finer dimensions to enhance coverage and discrimination, while filtering misalgined and redundant rubrics. The cycle stops when the number of discarded rubrics exceeds $N$, indicating saturation in novel, non-redundant, and valid rubrics. (III) Rubric Weight Assignment. For open-ended tasks where preference signal is distributed, assign whitened uniform (WU) weights to account for correlation structure and prevent over-representation of highly correlated rubrics. Otherwise, assign LLM-proposed heuristic weights. Empirically, WU weighting yields higher LLM judge accuracy and improves the effectiveness of rubrics as generative rewards in RFT.
Figure 3: (a) Accuracy on JudgeBench and PPE Preference datasets for base model and rubric-assisted judges under different rubric-generation strategies. While basic LLM-generated rubrics (unconditioned on sample responses) can degrade performance, RRD yields consistent improvements over the baselines. Notably, $RRD\xspace_\text{WU}$ delivers the largest gains and scales reliably across both proprietary (GPT-4o) and open-weights (Llama3.1-405B) judges. (b) Rubric-count dynamics on JudgeBench. Starting from an average of $7.4$ rubrics, the count rises to $\sim 20$, while the increasing variance across tasks indicates that the recursive procedure adapts evaluation depth to instance complexity.
Figure 4: Reward improvement during training of Qwen3-4B (left) and Llama3.1-8B-instruct (right) models using various rubric generation methods. Both $RRD\xspace_\text{WU}$ and $RRD\xspace_\text{LLM}$ provide a significantly stronger reward signal than traditional rubric-based or iterative baselines. $RRD\xspace_\text{WU}$, in particular, shows superior training stability and higher cumulative reward gains across both architectures, indicating a more robust and granular supervision signal for RFT.
Figure 5: Multi-dimensional evaluation (scores in percentage) on BiGGen Bench (left) and HealthBench-Hard (right). Comparison of $\text{RRD}_{\text{WU}}$ against five baseline methods using Llama-3.1 and Qwen3 base models. $\text{RRD}_{\text{WU}}$ (solid red) demonstrates robust improvements across all axes, particularly in Instruction Following (IF) and Completeness.

Theorems & Definitions (6)

Lemma 1
proof
Lemma 2
proof
Theorem 1
proof

Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

TL;DR

Abstract

Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)