Table of Contents
Fetching ...

Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

Zhiling Ye, Yun Yue, Haowen Wang, Xudong Han, Jiadi Jiang, Cheng Wei, Lei Fan, Jiaxin Liang, Shuowen Zhang, Ji Li, Chunxiao Guo, Jian Wang, Peng Wei, Jinjie Gu

TL;DR

The paper tackles the challenge of open-ended evaluation for health-domain LLMs by introducing Self-Rewarding Rubric-Based Reinforcement Learning (SR-RBRL), a framework where the policy acts as its own grader using rubric-based signals, eliminating the need for a separate reward model and achieving resource-efficient training. Leveraging HealthBench, the approach demonstrates that training with rubric-based self-scores on the HealthBench Easy set can surpass GPT-5 on HealthBench Hard, with additional gains from teacher data for weaker models. Empirical results show robust self-rewarding effectiveness, notable training-time reductions (~50%), and nuanced effects of dataset composition, where expert rubric data outperform synthetic signals and mixed data for strong models. The work offers a practical pathway to scalable, trustworthy open-ended reasoning in healthcare and points to future opportunities in broad-domain rubric generation and evaluation signals.

Abstract

Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.

Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning

TL;DR

The paper tackles the challenge of open-ended evaluation for health-domain LLMs by introducing Self-Rewarding Rubric-Based Reinforcement Learning (SR-RBRL), a framework where the policy acts as its own grader using rubric-based signals, eliminating the need for a separate reward model and achieving resource-efficient training. Leveraging HealthBench, the approach demonstrates that training with rubric-based self-scores on the HealthBench Easy set can surpass GPT-5 on HealthBench Hard, with additional gains from teacher data for weaker models. Empirical results show robust self-rewarding effectiveness, notable training-time reductions (~50%), and nuanced effects of dataset composition, where expert rubric data outperform synthetic signals and mixed data for strong models. The work offers a practical pathway to scalable, trustworthy open-ended reasoning in healthcare and points to future opportunities in broad-domain rubric generation and evaluation signals.

Abstract

Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.

Paper Structure

This paper contains 23 sections, 3 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Self-Rewarding Rubric-Based Reinforcement Learning Overview. Unlike standard GRPO paradigm, the policy model acts as the grader using task-specific rubrics, while the KL penalty is also omitted in our experiments.
  • Figure 2: HealthBench Meta score comparison. Reasoning models are shown in semi-transparent colors with hatching patterns.
  • Figure 3: Response length and reward grows as RL training progresses.
  • Figure 4: HealthBench Hard score progression during RL training. Green line is graded by GPT-4.1, same as the red dashed line representing OpenAI o3.
  • Figure 5: Self-rewarding training dynamics and evaluation results.
  • ...and 4 more figures