Table of Contents
Fetching ...

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, Chen Ma

TL;DR

RubricBench is introduced, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

Abstract

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

RubricBench: Aligning Model-Generated Rubrics with Human Standards

TL;DR

RubricBench is introduced, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

Abstract

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
Paper Structure (48 sections, 6 equations, 3 figures, 12 tables)

This paper contains 48 sections, 6 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Overview of RubricBench construction and evaluation setting. Starting from existing preference data, we curate challenging preference pairs via multi-dimensional filtering and annotate them with instruction-only human rubrics through a three-stage pipeline with quality control.
  • Figure 2: RubricBench statistics overview. (a) Domain and source composition of RubricBench. (b) Distribution of rubric items per example and text lengths of instructions, responses, and rubrics.
  • Figure 3: Test-time Scaling Results. (a) Scaling auto-rubrics. (b) Scaling human rubrics. (c) Scaling refinement. All vary test-time compute.