Table of Contents
Fetching ...

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, Bo Dai

TL;DR

The paper tackles the bottleneck of scalable, high-quality MLE training data by introducing MLE-Smith, a fully automated generate–verify–execute multi-agent pipeline. It enforces structural integrity, semantic soundness, and empirical solvability through a three-agent workflow (Brainstormer, Designer, Refactor) and a hybrid verification stack (Assertions, Reviews, Execution). Applied to 224 real-world datasets, it yields 606 verified tasks across diverse modalities, with eight LLMs showing performance trends that correlate with human-designed benchmarks, indicating realism and discriminative power. This approach enables scalable, end-to-end MLE benchmarks that can drive the evaluation and training of next-generation MLE agents at scale.

Abstract

While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

TL;DR

The paper tackles the bottleneck of scalable, high-quality MLE training data by introducing MLE-Smith, a fully automated generate–verify–execute multi-agent pipeline. It enforces structural integrity, semantic soundness, and empirical solvability through a three-agent workflow (Brainstormer, Designer, Refactor) and a hybrid verification stack (Assertions, Reviews, Execution). Applied to 224 real-world datasets, it yields 606 verified tasks across diverse modalities, with eight LLMs showing performance trends that correlate with human-designed benchmarks, indicating realism and discriminative power. This approach enables scalable, end-to-end MLE benchmarks that can drive the evaluation and training of next-generation MLE agents at scale.

Abstract

While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.

Paper Structure

This paper contains 68 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: MLE-Smith automatically generates competition-style machine learning engineering (MLE) tasks from raw datasets through a generate--verify--execute paradigm.
  • Figure 2: Domain, Modality, and Formulation Distribution of MLE-Smith generated tasks. From left to right, the panels show the distributions of modality, objective, domain, and metric, respectively. "Others" category aggregates all types whose individual proportions are relatively minor.
  • Figure 3: Pairwise win--loss matrices of eight models on the Dojo, Smith, and Combined sets. Each cell $(i,j)$ records the number of tasks on which model $i$ outperforms model $j$, and the aggregated score is computed by awarding $1$ point for a win, $0.5$ point for a tie, and $0$ points for a loss.
  • Figure 4: Step-wise Performance Dynamics of normalized raw scores. Curves are obtained by pointwise averaging over tasks in corresponding categories. Information-requesting steps are excluded.
  • Figure 5: Unified directory structure that Refactor should deliver.