Table of Contents
Fetching ...

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, Xueqi Cheng

TL;DR

RGR-GRPO introduces rubrics as dense, task-specific rewards and offline guidance to promote exploration in multi-domain reasoning for LLMs. By combining rubric-based Factual and Process criteria with an off-policy refinement mechanism (Exploration Assessment, Self-Refinement, and Mix-Policy GRPO), it achieves superior performance and training stability compared with verifiable-sparse rewards and other off-policy baselines. Across 14 benchmarks, including mathematics, physics, chemistry, and general reasoning, RGR-GRPO demonstrates notable average gains and robust generalization, with stable entropy dynamics during training and improved Pass@k results, indicating expanded reasoning horizons. The work also analyzes the roles of rubric design and self-refinement, showing that rubric-based rewards and offline refinements contribute to both immediate performance and longer-term robustness in complex, multi-domain tasks.

Abstract

Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose $\textbf{RGR-GRPO}$ (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

TL;DR

RGR-GRPO introduces rubrics as dense, task-specific rewards and offline guidance to promote exploration in multi-domain reasoning for LLMs. By combining rubric-based Factual and Process criteria with an off-policy refinement mechanism (Exploration Assessment, Self-Refinement, and Mix-Policy GRPO), it achieves superior performance and training stability compared with verifiable-sparse rewards and other off-policy baselines. Across 14 benchmarks, including mathematics, physics, chemistry, and general reasoning, RGR-GRPO demonstrates notable average gains and robust generalization, with stable entropy dynamics during training and improved Pass@k results, indicating expanded reasoning horizons. The work also analyzes the roles of rubric design and self-refinement, showing that rubric-based rewards and offline refinements contribute to both immediate performance and longer-term robustness in complex, multi-domain tasks.

Abstract

Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.

Paper Structure

This paper contains 41 sections, 1 theorem, 12 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.1

The Exploration Assessment (EA) mechanism, which conditionally applies off-policy refinement (Section sec:off-policy, Step 2-3) only when the on-policy exploration upper bound is insufficient (i.e., no perfect solution $o_{\text{best}}$ is found), is a necessary component for stabilizing the RGR-GRP

Figures (9)

  • Figure 1: Our RGR-GRPO shows strong cross-domain reasoning capability and expands the frontier of exploration.
  • Figure 2: Overview of the RGR-GRPO framework: (a) Construct rubrics for RL reward based on the input question and reference answer. (b) Conduct exploration assessment with the best response $o_{best}$, determining whether off-policy guidance is required. When exploration is insufficient, failed criteria are then used to refine $o_{best}$ into off-policy rollouts, and the sampling probabilities are reshaped via a shaping function to update the policy model.
  • Figure 3: Comparison of out-of-distribution (OOD) performance (%) on the MedMCQA and CS-Bench datasets.
  • Figure 4: Distribution and average of file / function match rate and resolved rate on SWE-Bench Lite LeaderBoard.
  • Figure 5: Pass@k performance (%) of Qwen2.5-7B across physics, chemistry, and math subjects in Sci-Bench.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Necessity of Exploration Assessment
  • proof