Table of Contents
Fetching ...

OpenSIR: Open-Ended Self-Improving Reasoner

Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini

TL;DR

OpenSIR proposes a dual-role, open-ended self-play framework in which a single LLM alternates between generating and solving novel mathematical problems without external supervision. By optimizing for both difficulty (solvability and solution length) and diversity (embedding-based novelty), the method creates an adaptive curriculum that drives continual exploration of new concepts. Empirically, OpenSIR improves multiple instruction-tuned models on GSM8K and College Math and outperforms GRPO baselines trained on thousands of annotated examples, demonstrating open-ended autonomous mathematical reasoning. The work highlights the importance of co-evolving teacher-student dynamics and diversity rewards for long-horizon self-improvement, offering a scalable approach to bootstrapping advanced reasoning skills without labeled data.

Abstract

Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.

OpenSIR: Open-Ended Self-Improving Reasoner

TL;DR

OpenSIR proposes a dual-role, open-ended self-play framework in which a single LLM alternates between generating and solving novel mathematical problems without external supervision. By optimizing for both difficulty (solvability and solution length) and diversity (embedding-based novelty), the method creates an adaptive curriculum that drives continual exploration of new concepts. Empirically, OpenSIR improves multiple instruction-tuned models on GSM8K and College Math and outperforms GRPO baselines trained on thousands of annotated examples, demonstrating open-ended autonomous mathematical reasoning. The work highlights the importance of co-evolving teacher-student dynamics and diversity rewards for long-horizon self-improvement, offering a scalable approach to bootstrapping advanced reasoning skills without labeled data.

Abstract

Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.

Paper Structure

This paper contains 46 sections, 10 equations, 22 figures, 11 tables, 1 algorithm.

Figures (22)

  • Figure 1: Overview of the framework. A single policy $\pi_\theta$ alternates between generating and solving novel problems without external supervision. Each training iteration consists of problem generation, solution sampling, scoring, and model update. Novelty is captured through both difficulty and diversity: problems must be challenging yet solvable, and they must explore new concepts. These dimensions together drive open-ended self-improvement in the LLM reasoning ability.
  • Figure 2: Evolution of problem difficulty, validity, and topic diversity during training. (Left) Human evaluation results showing difficulty rankings (1-5 scale where 1=easiest, 5=hardest) and number of invalid problems for GSM8K, MATH, and problems generated at steps 0, 100, and 200 of training. Invalid problems are those with logical flaws, missing information, or ambiguities. (Right) Distribution of mathematical topics across training stages, demonstrating the increasing diversity of generated problems from step 0 to step 200.
  • Figure 3: t-SNE visualization of problem embeddings showing the effect of diversity reward on problem distribution. With diversity reward, problems explore broader regions of the embedding space compared to the clustered distribution without diversity reward.
  • Figure 4: An invalid arithmetic question generated in step 0 with solve rate of 0.25. This question is invalid since the VIP tick price is not provided, and therefore, it's impossible to calculate the minimum regular ticket price.
  • Figure 5: An invalid arithmetic question generated in step 0 with solve rate of 0.125. This question is invalid since the two interest rates and principal amounts are not provided. Hence, it's impossible to calculate the percentage difference with just the general formula provided.
  • ...and 17 more figures