Table of Contents
Fetching ...

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung

TL;DR

This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments.

Abstract

As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

TL;DR

This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments.

Abstract

As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.
Paper Structure (22 sections, 11 equations, 10 figures, 2 tables)

This paper contains 22 sections, 11 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Example of code-driven problem evolution. The agent analyzes the seed problem and performs computational exploration to enumerate valid configurations under structural constraints. The empirical findings are then abstracted into an evolved problem with increased combinatorial and structural complexity.
  • Figure 2: Overview of our multi-agent system. Our pipeline consists of three components: the Evolution Agent, the Solvability Verification Agent, and the Difficulty Verification Agent. It is equipped with code tools related to mathematics. The framework takes an original problem and its solution as input and outputs a validated new problem along with a solution for reference.
  • Figure 3: Distribution of Average Token Consumption (ATC) across original and agent-evolved problems.For each problem, we compute the average output tokens across all solver models. Timeout samples (where solvers failed to produce output) are assigned the maximum token limit to reflect their high difficulty
  • Figure 4: Efficiency Analysis of Agentic Problem Evolution. We visualize the distribution of failure counts encountered during the evolutionary process across three base models: DeepSeek-Chat, DeepSeek-Reasoner, and Gemini-3-Pro-Preview-Thinking. The histograms depict the Total Failures (left), decomposed into rejections by the Solvability Verification Agent (middle) and the Difficulty Verification Agent (right).
  • Figure 5: The prompt template of our Evolution Agent.
  • ...and 5 more figures