Table of Contents
Fetching ...

Synthetic Sandbox for Training Machine Learning Engineering Agents

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan

Abstract

As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.

Synthetic Sandbox for Training Machine Learning Engineering Agents

Abstract

As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.

Paper Structure

This paper contains 55 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Standard on-policy RL for MLE tasks (left) is bottlenecked by execution latency: each rollout step requires running full ML pipeline on large associated datasets ($>$200s). SandMLE transforms seed tasks into diverse synthetic environments with micro-scale datasets ($<$15s), making trajectory-wise on-policy RL practically feasible.
  • Figure 2: The Agentic MLE Environment Factory. Illustrated is the procedural multi-agent workflow that transforms a massive, slow-executing seed task into a coherent, high-speed, verifiable synthetic micro-task. The pipeline explicitly optimizes for data efficiency by strictly controlling the task-associated training data size (down from 196,157 to less than 200) to enable rapid iteration during policy optimization.
  • Figure 3: Distribution of the synthetic training data across three axes: application domain (left), data modality (center), and task formulation (right).
  • Figure 4: Pairwise win counts and win rates (%) across 64 synthetic tasks. Each model is evaluated against the other three; ties count as 0.5 wins.
  • Figure 5: Distribution of per-task dataset sizes across the synthetic training corpus.
  • ...and 3 more figures