Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

Zhiling Ye; Yun Yue; Haowen Wang; Xudong Han; Jiadi Jiang; Cheng Wei; Lei Fan; Jiaxin Liang; Shuowen Zhang; Ji Li; Chunxiao Guo; Jian Wang; Peng Wei; Jinjie Gu

Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

Zhiling Ye, Yun Yue, Haowen Wang, Xudong Han, Jiadi Jiang, Cheng Wei, Lei Fan, Jiaxin Liang, Shuowen Zhang, Ji Li, Chunxiao Guo, Jian Wang, Peng Wei, Jinjie Gu

TL;DR

The paper tackles the challenge of open-ended evaluation for health-domain LLMs by introducing Self-Rewarding Rubric-Based Reinforcement Learning (SR-RBRL), a framework where the policy acts as its own grader using rubric-based signals, eliminating the need for a separate reward model and achieving resource-efficient training. Leveraging HealthBench, the approach demonstrates that training with rubric-based self-scores on the HealthBench Easy set can surpass GPT-5 on HealthBench Hard, with additional gains from teacher data for weaker models. Empirical results show robust self-rewarding effectiveness, notable training-time reductions (~50%), and nuanced effects of dataset composition, where expert rubric data outperform synthetic signals and mixed data for strong models. The work offers a practical pathway to scalable, trustworthy open-ended reasoning in healthcare and points to future opportunities in broad-domain rubric generation and evaluation signals.

Abstract

Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.

Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

TL;DR

Abstract

Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)