SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

Xiaojiang Zhang; Jinghui Wang; Zifei Cheng; Wenhao Zhuang; Zheng Lin; Minglei Zhang; Shaojie Wang; Yinghan Cui; Chao Wang; Junyi Peng; Shimiao Jiang; Shiqi Kuang; Shouyu Yin; Chaohang Wen; Haotian Zhang; Bin Chen; Bing Yu

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, Shimiao Jiang, Shiqi Kuang, Shouyu Yin, Chaohang Wen, Haotian Zhang, Bin Chen, Bing Yu

TL;DR

SRPO tackles cross-domain reinforcement learning for large language models by addressing GRPO limitations through a two-stage training regime and history resampling. The method first builds strong mathematical reasoning, then integrates coding skills atop the developed reasoning foundation, supported by a data curation pipeline and a rule-based reward design. Empirical results show SRPO outperforms prior multi-domain methods on AIME24 and LiveCodeBench using roughly one-tenth the training steps of the R1-Zero baseline, illustrating improved sample efficiency and cross-domain transfer. The work provides insights into thinking patterns and progression of problem-solving strategies in large-scale RL for math and code tasks.

Abstract

Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. SRPO achieves this using the same base model as DeepSeek (i.e. Qwen2.5-32B), using only about 1/10 of the training steps required by DeepSeek-R1-Zero-32B, demonstrating superior efficiency. Building upon Group Relative Policy Optimization (GRPO), we introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples. Our comprehensive experiments validate the effectiveness of our approach, offering valuable insights into scaling LLM reasoning capabilities across diverse tasks.

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

TL;DR

Abstract

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)