Table of Contents
Fetching ...

Reinforcement Learning with Rubric Anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, Junbo Zhao

TL;DR

Open-ended tasks challenge RLVR that relies on verifiable rewards. Rubicon introduces rubric-anchored RL with a large, multi-dimensional rubric bank and a two-stage RL pipeline to enhance instruction-following and creative tasks while preserving general ability. It employs adaptive defenses against reward hacking and demonstrates a 5.2% average gain on humanities benchmarks with only 5k samples, outperforming a much larger baseline on several metrics. The work broadens RLVR applicability to subjective and stylistic outputs, highlights the importance of rubric diversity and data curation, and identifies challenges and future directions for scalable rubric-based RL.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

Reinforcement Learning with Rubric Anchors

TL;DR

Open-ended tasks challenge RLVR that relies on verifiable rewards. Rubicon introduces rubric-anchored RL with a large, multi-dimensional rubric bank and a two-stage RL pipeline to enhance instruction-following and creative tasks while preserving general ability. It employs adaptive defenses against reward hacking and demonstrates a 5.2% average gain on humanities benchmarks with only 5k samples, outperforming a much larger baseline on several metrics. The work broadens RLVR applicability to subjective and stylistic outputs, highlights the importance of rubric diversity and data curation, and identifies challenges and future directions for scalable rubric-based RL.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

Paper Structure

This paper contains 29 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An overview of our rubric system. The Data Collection phase (left, orange) begins with an initial Rubric Design to create a set of tagging & scoring workflow, which filters a large corpus into high-quality Offline Filter Data. This data then seeds the Rubric Updating phase (right, green), where an RL with rubrics loop not only validates RL Data but also provides feedback to iteratively update the rubric itself. This iterative process ensures that the Final Data is tightly aligned with a continuously improving, model-verifiable evaluation standard.
  • Figure 2: The gray point represents the baseline model, Qwen3-30B-A3B. The orange markers indicate the RL‑trained model on creativity tasks only, while green markers indicate the RL‑trained model on instruction‑following tasks only. The vertical axis denotes task categories, and the horizontal axis shows the model performance on the corresponding tasks.