Table of Contents
Fetching ...

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu

TL;DR

This work introduces Guru, a 92K-example, six-domain RL corpus designed to study general-purpose reasoning in large language models. Through domain-specific reward design, deduplication, and difficulty-aware filtering, Guru enables controlled cross-domain RL experiments that reveal strong domain-dependent transfer: pretraining-rich domains (Math, Code, Science) benefit from cross-domain RL, while underrepresented domains (Logic, Simulation, Tabular) require in-domain data for meaningful gains. Large-scale experiments with Guru-7B and Guru-32B demonstrate state-of-the-art performance among open RL-trained models on a 17-task, six-domain evaluation suite, with Pass@k analyses illustrating nuanced improvements across tasks and decoding settings. The work emphasizes multi-domain RL as a robust path to general reasoning capabilities and provides open data, models, and code to spur further research in general-purpose RL for reasoning.

Abstract

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

TL;DR

This work introduces Guru, a 92K-example, six-domain RL corpus designed to study general-purpose reasoning in large language models. Through domain-specific reward design, deduplication, and difficulty-aware filtering, Guru enables controlled cross-domain RL experiments that reveal strong domain-dependent transfer: pretraining-rich domains (Math, Code, Science) benefit from cross-domain RL, while underrepresented domains (Logic, Simulation, Tabular) require in-domain data for meaningful gains. Large-scale experiments with Guru-7B and Guru-32B demonstrate state-of-the-art performance among open RL-trained models on a 17-task, six-domain evaluation suite, with Pass@k analyses illustrating nuanced improvements across tasks and decoding settings. The work emphasizes multi-domain RL as a robust path to general reasoning capabilities and provides open data, models, and code to spur further research in general-purpose RL for reasoning.

Abstract

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

Paper Structure

This paper contains 47 sections, 20 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Left: Absolute accuracy gain (%) over the base model when RL-trained on various domains. Pretrained-heavy domains (Math, Code, Science) benefit from cross-domain training, while others require in-domain data, indicating RL aids skill acquisition. Right: Our Guru-7B/32B models consistently outperform strong open baselines across 17 reasoning tasks when RL-trained with mixed-domain data.
  • Figure 2: Overview of the data curation pipeline of Guru dataset.
  • Figure 3: Cross-Domain RL Transfer Performance per Task. The heatmap shows the performance gains (accuracy) from RL training on different domains (rows) when evaluated on the test sets on different domains (columns). Warmer colors indicate higher performance gains, computed by applying min-max normalization to the validation accuracies within each column. Accuracy is reported using the checkpoint with the highest average score across tasks. Khaki-colored rectangles mark in-domain evaluations (diagonal); others reflect cross-domain generalization. This highlights differential transferability: Math, Code, and Science benefit significantly from cross-domain transfer, while Logic, Simulation, and Tabular tasks see limited gains, with improvements primarily driven by within-domain training. Notably, naive mixed-domain by combining data from all domains performs on par or better than single-domain RL.
  • Figure 4: The reward and response length of each domain during RL training with: ( top row) single domain data (3k examples each) from Guru-18k; and ( bottom row): using the full Guru-18k mixture dataset. The x-axis is the number of gradient update steps.
  • Figure 5: Pass@k analysis of Guru. (a) Pass@k on AIME24 (Math) and Zebra Puzzle (Logic) for base Qwen2.5-7B/32B vs. RL-tuned Guru-7B/32B. (b) Pass@k under different decoding settings: higher sampling temperature or larger top-p broadens exploration and offsets RL-induced entropy collapse.
  • ...and 4 more figures