Table of Contents
Fetching ...

AI Scientist via Synthetic Task Scaling

Ziyang Cai, Harkirat Behl

Abstract

With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don't offer a principled way to train such agents -- and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.

AI Scientist via Synthetic Task Scaling

Abstract

With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don't offer a principled way to train such agents -- and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.
Paper Structure (40 sections, 6 figures)

This paper contains 40 sections, 6 figures.

Figures (6)

  • Figure 1: Illustration of our task and trajectory generation workflow. Crucially, the task generation process does not require human supervision. Instead, it automatically samples machine learning topics and proposes dataset to use in the task. To resolve compilation issues in generated tasks, we further enhance the generation with a debug loop instead of immediately discarding the task altogether.
  • Figure 2: Generated trajectory count for each task. We select 20 generated tasks and show the number of successful trajectories for each task. Because of the unsupervised nature of our pipeline, we don't expect all tasks to successfully create all 256 trajectories.
  • Figure 3: Top left: summary statistics of the final training trajectories. Top right: Statistics of truncated trajectories. Bottom left: distribution of tasks by token length. Bottom right: distribution of number of turns in the trajectory.
  • Figure 4: Model performance comparison between the baselines: GPT-4o, GPT-5, Qwen3-4B and Qwen3-8B, and our trained models: SFT-Qwen3-4B and SFT-Qwen8B. The performance is aggregated across 64 runs, which is displayed as violin plots for each subtask of MLGym. If all of the tasks fail, then the chart would show empty bar. In 9 out of 13 tasks, our trained models perform better than the baseline Qwen3-4B models.
  • Figure 5: The aggregate performance on MLGym. Since different sub-tasks in MLGym have different score scale and comparison direction, nathani2025mlgymnewframeworkbenchmark introduced the AUP score, which stands for area under the performance curve. Here we report the AUP score of each of the models.
  • ...and 1 more figures