Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics
Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, Tat-Seng Chua
TL;DR
This work tackles the challenge of directly rewarding chain-of-thought reasoning by introducing RLCER, a framework where a single policy plays two roles—reasoner and rubricator—to generate self-proposed rubrics that supervise CoT. Rubrics are evaluated by a verifier and evolve over time by rewarding rubric validity, enabling outcome-free supervision that correlates with final-answer correctness. Empirical results show RLCER outperforms traditional outcome-centric RLVR across multiple datasets and model sizes, with larger models benefiting most, and rubrics providing valuable in-prompt hints to boost inference-time reasoning. The findings suggest a new paradigm where LLMs autonomously improve not just what they answer but how they think, potentially reducing human labeling needs and improving robustness to distribution shifts.
Abstract
Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.
