Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Leheng Sheng; Wenchang Ma; Ruixin Hong; Xiang Wang; An Zhang; Tat-Seng Chua

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, Tat-Seng Chua

TL;DR

This work tackles the challenge of directly rewarding chain-of-thought reasoning by introducing RLCER, a framework where a single policy plays two roles—reasoner and rubricator—to generate self-proposed rubrics that supervise CoT. Rubrics are evaluated by a verifier and evolve over time by rewarding rubric validity, enabling outcome-free supervision that correlates with final-answer correctness. Empirical results show RLCER outperforms traditional outcome-centric RLVR across multiple datasets and model sizes, with larger models benefiting most, and rubrics providing valuable in-prompt hints to boost inference-time reasoning. The findings suggest a new paradigm where LLMs autonomously improve not just what they answer but how they think, potentially reducing human labeling needs and improving robustness to distribution shifts.

Abstract

Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

TL;DR

Abstract

Paper Structure (25 sections, 13 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 12 figures, 3 tables, 1 algorithm.

Introduction
Related Works
LLM Reasoning with RL
Self-Evolving in LLMs
Reinforcement Learning with Rubrics
Preliminaries
Incentivizing LLM Reasoning with RL
Multi-Role RL Under a Single Policy Model
Methodology
Key Idea
Two Roles in One Policy: Reasoner and Rubricator
Rewarding How to Think via Self-Proposed Rubrics
Rubrics Self-Evolving for Better Supervision
Two-Role Optimization under a Single Policy
Experiments
...and 10 more sections

Figures (12)

Figure 1: Performance across three math datasets on the 7B model. Training with RLCER leads to a higher performance ceiling, and the self-evolving rubrics further enhance reasoning performance.
Figure 2: Key idea of reinforcement learning with CoT supervision via self-evolving rubrics (RLCER). The policy model $\pi_\theta$ acts as both the reasoner and the rubrics generator, self-generating and self-evolving the rubrics for CoT supervision, where the evolving direction is shaped towards the correlation with the rubrics satisfaction and the final answer correctness.
Figure 3: The RLCER loop. Format reward is ignored for brevity. One single policy model self-proposes rubrics for rewarding CoTs, and self-evolves the rubrics via rewarding generation capabilities.
Figure 4: Illustration of the reward calculation process in RLCER. For question $\mathcal{Q}$, the reasoner generates $N$ responses each with CoT $\hat{\mathcal{C}}_k$ and the final answer $\hat{\mathcal{A}}_k$. After that, the rubricator generates $K_{n}$ specific rubrics (i.e., $\hat{\mathcal{R}}_n \triangleq \{\hat{\tau}_k\}_{k=1}^{K_n}$). The outcome reward is applied first by matching the generated answer $\hat{\mathcal{A}}_k$ with the ground-truth answer $\mathcal{A}$. All the valid rubrics (i.e., $\mathtt{corr}(\mathbf v_k,\mathbf z)>\alpha$ and $\mathtt{std}(\mathbf v_k)>0$) are collected for rewarding CoTs. And the fraction of valid rubrics for the $k$-th rubricator generation (i.e., $\frac{|\{\hat{\tau}^{valid}_{n,k}\}|}{K_n}$) is used for rewarding the rubricator to self-evolve.
Figure 5: Accuracy dynamics when only rewarding CoTs. Even when rewarding CoTs with self-proposed rubrics without any outcome reward can bring consistent performance gain.
...and 7 more figures

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

TL;DR

Abstract

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Authors

TL;DR

Abstract

Table of Contents

Figures (12)