Table of Contents
Fetching ...

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, Dacheng Tao

TL;DR

SeRL introduces a data-efficientRL framework for LLMs in data-scarce domains by coupling self-instruction (with online filtering for quality, diversity, and difficulty) and majority-vote based self-rewarding (no external labels). The method enables iterative unsupervised RL over generated data, yielding performance competitive with or superior to baselines trained on high-quality rewards, across math-centric benchmarks and multiple backbones. Key contributions include a dual-end difficulty filter to prevent reward hacking, and empirical evidence that majority-vote rewards align closely with rule-based rewards while maintaining training stability. The approach proves practical for domains where labeled data are expensive or unavailable, and code is released for reproducibility.

Abstract

Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

TL;DR

SeRL introduces a data-efficientRL framework for LLMs in data-scarce domains by coupling self-instruction (with online filtering for quality, diversity, and difficulty) and majority-vote based self-rewarding (no external labels). The method enables iterative unsupervised RL over generated data, yielding performance competitive with or superior to baselines trained on high-quality rewards, across math-centric benchmarks and multiple backbones. Key contributions include a dual-end difficulty filter to prevent reward hacking, and empirical evidence that majority-vote rewards align closely with rule-based rewards while maintaining training stability. The approach proves practical for domains where labeled data are expensive or unavailable, and code is released for reproducibility.

Abstract

Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.

Paper Structure

This paper contains 27 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An overview of the proposed SeRL framework, which comprises two core components: (1) Self-Instruction, where the model generates new instructions from a small initial dataset and applies a robust online filtering strategy to ensure instruction quality, diversity, and appropriate difficulty. (2) Self-Rewarding, where the model performs unsupervised RL training using a majority-vote reward mechanism without relying on verifiable labels.
  • Figure 2: Training curve of LLaMA-3.2-3B-Instruct without the online filter strategy.
  • Figure 3: Cosine similarity between rule-based reward and majority-vote-based / model-based reward for each instruction on MATH500. Each point represents the similarity over 16 sampled responses per instruction. "sim(GT, Model-based)" refers to the cosine similarity between the rule-based Reward and the model-based Reward, while "sim(GT, Majority-vote-based)" measures the cosine similarity between the rule-based reward and our majority-vote-based reward. The dashed lines in the figure indicate the average values.
  • Figure 4: Comparison of difficulty and diversity of $\mathcal{D}^1_{\text{gen}}$ before and after applying the difficulty filtering. The y-axis label # Instructions represents the number of instructions.
  • Figure 5: Analysis of generated instructions difficulty and diversity.
  • ...and 2 more figures