Improving Data and Reward Design for Scientific Reasoning in Large Language Models
Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong, Peng Cheng
TL;DR
This work tackles the bottlenecks of data design and reward design in open-ended scientific reasoning for large language models. It introduces the Dr. SCI dataset (1,006,701 questions across eight STEM fields) with explicit verifiable and open-ended splits, plus fine-grained rubrics and difficulty annotations to support reliable post-training. It then presents a unified post-training pipeline—Exploration-Expanding SFT, Dynamic Difficulty Curriculum, and SciRubric-Guided RL—that expands reasoning patterns, matches training difficulty to capability, and provides stable, rubric-informed rewards for open-ended tasks. Experiments with a 4B backbone show substantial gains over strong post-trained baselines on GPQA benchmarks, including open-ended GPQA-General, underscoring the practical impact of principled data processing and rubric-based RL for scientific reasoning.
Abstract
Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
