Table of Contents
Fetching ...

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, Fei Tan

TL;DR

This work proposes MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning, which exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.

Abstract

Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities. Our code and data are available at https://github.com/Jasaxion/MathSmith.

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

TL;DR

This work proposes MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning, which exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.

Abstract

Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities. Our code and data are available at https://github.com/Jasaxion/MathSmith.

Paper Structure

This paper contains 33 sections, 13 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The MathSmith workflow, comprising the phases of concept and explanation collection, supervised fine-tuning, and reinforcement learning.
  • Figure 2: Average reasoning trace token length under thinking mode (Qwen3-30B-A3B) across various open-source math datasets. Problems synthesized by MathSmith variants elicit significantly longer reasoning, reflecting higher complexity.
  • Figure 3: The MathSmith weakness-focused improvement pipeline. Problems are traced to concept explanations, which serve as a basis for generating targeted variants to strengthen model weaknesses.
  • Figure 4: Performance on the Olympiad benchmark under varying training data volumes (50K–200K). MathSmith-HC scales more effectively than baseline methods.
  • Figure 5: Accuracy on the Olympiad benchmark across Qwen3 model series. MathSmith-HC yields greater gains with larger models.
  • ...and 4 more figures