Improving Data and Reward Design for Scientific Reasoning in Large Language Models

Zijie Chen; Zhenghao Lin; Xiao Liu; Zhenzhong Lan; Yeyun Gong; Peng Cheng

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong, Peng Cheng

TL;DR

This work tackles the bottlenecks of data design and reward design in open-ended scientific reasoning for large language models. It introduces the Dr. SCI dataset (1,006,701 questions across eight STEM fields) with explicit verifiable and open-ended splits, plus fine-grained rubrics and difficulty annotations to support reliable post-training. It then presents a unified post-training pipeline—Exploration-Expanding SFT, Dynamic Difficulty Curriculum, and SciRubric-Guided RL—that expands reasoning patterns, matches training difficulty to capability, and provides stable, rubric-informed rewards for open-ended tasks. Experiments with a 4B backbone show substantial gains over strong post-trained baselines on GPQA benchmarks, including open-ended GPQA-General, underscoring the practical impact of principled data processing and rubric-based RL for scientific reasoning.

Abstract

Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 5 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 5 figures, 8 tables, 1 algorithm.

Introduction
Dr. SCI Dataset
Data Collection
Data Processing Pipeline
Dataset Statistics
Dr. SCI Post Training
Exploration-Expanding SFT
Dynamic Difficulty Curriculum
SciRubric-Guided RL
Experiments
Implementation Details
Evaluations
Baselines
Experiment Results
Analysis
...and 16 more sections

Figures (5)

Figure 1: Model performance on core scientific reasoning benchmarks. Dr. SCI surpasses strong baselines like o1-mini, GPT-4o.
Figure 2: Subject distribution of Dr. SCI dataset.
Figure 3: Length Distribution of Dr. SCI dataset.
Figure 4: Difficulty Distribution of Dr. SCI dataset.
Figure 5: Dynamics and performance of the dynamic difficulty curriculum. (a) Our curriculum dynamically adjusts the average difficulty of training data accoring to current model capabilities. (b) This yields steady performance growth in scientific reasoning.

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

TL;DR

Abstract

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)