Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Ruipeng Jia; Yunyi Yang; Yuxin Wu; Yongbo Gai; Siyuan Tao; Mengyu Zhou; Jianhe Lin; Xiaoxi Jiang; Guanjun Jiang

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

TL;DR

This work presents the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics and lightweight Pointwise Verifiable Rubrics, complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks.

Abstract

Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

TL;DR

Abstract

Paper Structure (42 sections, 6 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 42 sections, 6 equations, 14 figures, 6 tables, 1 algorithm.

Introduction
Preliminary
Group Relative Policy Optimization
Pairwise Evaluation and Bootstrapped Relative Policy Optimization
Approach
Overall Framework
Pairwise Adaptive Rubric.
Hierarchical Meta Rubric.
Evolutionary Rubric Refinement.
Pointwise Verifiable Rubric.
General Meta Rubric Refinement
Domain Meta Rubric Refinement
Applying OpenRS to Reinforcement Learning
Experiments
Experimental Setup
...and 27 more sections

Figures (14)

Figure 1: Overall Framework of OpenRS
Figure 2: Pareto Frontier for Open Rubric System on Different Benchmarks.
Figure 3: Training dynamics of the refinement policy $\pi_{\text{refine}}$ under different settings.
Figure 4: Evolution of metrics during RL training: (a) Policy Entropy; (b) The ratio of 'Same' judgments by the pairwise adaptive rubric.
Figure 5: General Meta Rubric, Chinese Version
...and 9 more figures

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

TL;DR

Abstract

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Authors

TL;DR

Abstract

Table of Contents

Figures (14)