Table of Contents
Fetching ...

Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models

Joykirat Singh, Tanmoy Chakraborty, Akshay Nambi

TL;DR

SPHERE addresses the challenge of error propagation and data scarcity in multi-step mathematical reasoning by introducing a fully automated, three-stage self-evolving data generation pipeline for small language models. It uses a pruned MCTS-based data generator guided by a Process Reward Model and an Outcome-Supervised Reward to create high-quality correct and flawed reasoning, which is then learned via Direct Preference Optimization. The approach yields substantial improvements across benchmark datasets (Math 500, GSM8K, AIME, AMC, Olympiad), outperforming base variants and matching or exceeding GPT-4o on several tasks. This work demonstrates a scalable, annotation-free route to close the reasoning gap for small models and accelerate robust mathematical reasoning in AI.

Abstract

Large language models (LLMs) have significantly improved their reasoning capabilities; however, they still struggle with complex multi-step mathematical problem-solving due to error propagation, lack of self-correction, and limited adaptability to diverse reasoning styles. Existing methods rely on static fine-tuning or prompt engineering, which fail to generalize across problem complexities, while the scarcity of high-quality preference data further hinders reliable reasoning. We introduce SPHERE, a self-evolving data generation pipeline that enhances reasoning in small language models (SLMs) by iteratively generating, correcting, and diversifying reasoning chains. SPHERE operates in three stages: (i) Self-Generation, where the model autonomously constructs problem-solving steps; (ii) Self-Correction, enabling it to identify and rectify errors; and (iii) Diversity Induction, improving robustness through multiple valid reasoning trajectories. This self-evolution mechanism strengthens mathematical reasoning and enhances model reliability. Evaluations on MATH 500, GSM8K, AIME, AMC, and Olympiad show that SPHERE-trained models achieve significant gains over their base versions and match/surpass GPT-4o on certain benchmarks. Our findings demonstrate that self-evolving models can close the reasoning gap between SLMs and state-of-the-art LLMs, making mathematical AI more reliable, scalable, and efficient.

Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models

TL;DR

SPHERE addresses the challenge of error propagation and data scarcity in multi-step mathematical reasoning by introducing a fully automated, three-stage self-evolving data generation pipeline for small language models. It uses a pruned MCTS-based data generator guided by a Process Reward Model and an Outcome-Supervised Reward to create high-quality correct and flawed reasoning, which is then learned via Direct Preference Optimization. The approach yields substantial improvements across benchmark datasets (Math 500, GSM8K, AIME, AMC, Olympiad), outperforming base variants and matching or exceeding GPT-4o on several tasks. This work demonstrates a scalable, annotation-free route to close the reasoning gap for small models and accelerate robust mathematical reasoning in AI.

Abstract

Large language models (LLMs) have significantly improved their reasoning capabilities; however, they still struggle with complex multi-step mathematical problem-solving due to error propagation, lack of self-correction, and limited adaptability to diverse reasoning styles. Existing methods rely on static fine-tuning or prompt engineering, which fail to generalize across problem complexities, while the scarcity of high-quality preference data further hinders reliable reasoning. We introduce SPHERE, a self-evolving data generation pipeline that enhances reasoning in small language models (SLMs) by iteratively generating, correcting, and diversifying reasoning chains. SPHERE operates in three stages: (i) Self-Generation, where the model autonomously constructs problem-solving steps; (ii) Self-Correction, enabling it to identify and rectify errors; and (iii) Diversity Induction, improving robustness through multiple valid reasoning trajectories. This self-evolution mechanism strengthens mathematical reasoning and enhances model reliability. Evaluations on MATH 500, GSM8K, AIME, AMC, and Olympiad show that SPHERE-trained models achieve significant gains over their base versions and match/surpass GPT-4o on certain benchmarks. Our findings demonstrate that self-evolving models can close the reasoning gap between SLMs and state-of-the-art LLMs, making mathematical AI more reliable, scalable, and efficient.

Paper Structure

This paper contains 37 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of Pruned MCTS Rollouts.
  • Figure 2: Illustration of all stages in Pruned MCTS. C and IC denote $Sol_{max}$ (correct solution) and $Sol_{min}$ (incorrect solution) extracted from each rollout. Reasoning pairs within the Gold Box are selected for preference learning.
  • Figure 3: Stage I, rollout with question: There were 61 parents in the program and some pupils too. The program could seat 44 people. There were 238 people present in the program. How many pupils were present in the program?
  • Figure 4: Stage II, rollout with question: There were 61 parents in the program and some pupils too. The program could seat 44 people. There were 238 people present in the program. How many pupils were present in the program? and incorrect solution. The model is prompted to identify and rectify any mistakes in the reasoning chain
  • Figure 5: Stage I, rollout with question: If the Great Pyramid of Giza is 20 feet 1073 taller than a structure that is 500 feet tall and 234 1074 feet wider than its height, what is the total sum of 1075 its height and width in feet?
  • ...and 1 more figures