Table of Contents
Fetching ...

Training AI Co-Scientists Using Rubric Rewards

Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse

TL;DR

The paper introduces an automated, rubric-driven RL approach to train AI co-scientists to generate research plans by mining goals and goal-specific rubrics from papers. It uses a self-grading loop where a frozen policy grader evaluates plans against extracted rubrics, enabling generator-verifier improvements without ongoing human supervision. Across ML, ArXiv, and Medical domains, finetuned models show consistent improvements and cross-domain generalization, with human experts preferring the finetuned plans for a majority of goals. The work demonstrates scalable data collection, automated evaluation, and cross-domain applicability, moving toward general AI co-scientists that can assist researchers across fields.

Abstract

AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.

Training AI Co-Scientists Using Rubric Rewards

TL;DR

The paper introduces an automated, rubric-driven RL approach to train AI co-scientists to generate research plans by mining goals and goal-specific rubrics from papers. It uses a self-grading loop where a frozen policy grader evaluates plans against extracted rubrics, enabling generator-verifier improvements without ongoing human supervision. Across ML, ArXiv, and Medical domains, finetuned models show consistent improvements and cross-domain generalization, with human experts preferring the finetuned plans for a majority of goals. The work demonstrates scalable data collection, automated evaluation, and cross-domain applicability, moving toward general AI co-scientists that can assist researchers across fields.

Abstract

AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.

Paper Structure

This paper contains 84 sections, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Summary of methodology. (Bottom) We train models to generate research plans for a given research goal. We obtain rewards for RL by using the initial model to grade generated plans with the help of rubrics. (Left) To collect training data, we use a sample creator model (Llama-4-Maverick) to extract up to three samples per research paper, each including a research goal, goal-specific grading rubric, and reference solution. For each of these components, we provide guidelines to a sample selector model (Claude-4-Sonnet) that picks one best sample per paper for our use. (Right) During grading, the goal-specific rubrics are used alongside a list of seven general guidelines that are checked for the part of the plan relevant to each rubric item. Rubric items that meet all guidelines are marked satisfied, and the fraction of satisfied rubric items is used as part of the training reward and evaluation scores.
  • Figure 2: A sample from our Dataset-ML test set, automatically extracted from a published ML paper. When evaluating a proposed plan for the research goal (top-left), for each goal-specific rubric item (bottom), the grading model reasons about the part of the plan addressing that item and checks for violations of general guidelines (top-right). Some rubric items test constraints explicitly stated in the research goal, while others check for implicit features or requirements.
  • Figure 3: Research plan generation scores of models across research goals extracted from ML, medical, and arXiv papers. We report average scores on the full test sets across rubric grading by three frontier models, and use bootstrap sampling for error bars. In (a)-(c), we observe that our domain finetuned models always improve over the initial policy (Qwen-3-30B-A3B-Instruct). We also find that GPT models consistently outperform the rest, while within model families, more recent and larger models perform better, as expected. In (d), we observe that our finetuning also leads to significant cross-domain generalization. For example, the medical finetune improves significantly on ML and arXiv research goals.
  • Figure 5: Finetuning improvements based on rubric grading across a jury of three judges (GPT-5-Thinking, Claude-4-Sonnet, and Gemini-2.5-Pro).
  • Figure 6: Training both Qwen-3-4B instruct and thinking model with Qwen-3-30B MoE Reward model $\theta_r$. Results on validation set. The grader used for scoring is Claude-4-Sonnet. We see very similar performance between instruct and thinking, even though thinking requires more than 2x compute for the same number of training steps.
  • ...and 9 more figures