Table of Contents
Fetching ...

Towards Automated Machine Learning Research

Shervin Ardeshir

TL;DR

The paper proposes a top-down, LLM-driven framework for automated ML research that generates and evaluates novel ML components via a Generator–Validator–Evaluator loop, augmented by a Reward model that ranks hypotheses using $B\text{-}WR$ and $BSOTA\text{-}WR$. It treats hypothesis generation as a cross-domain exploration problem, enabling the discovery of non-traditional components (e.g., activation functions, preprocessors, regularizers) and testing them in a lightweight, one-pass setting before committing substantial compute. An empirical study generates 36,000 hypotheses across three component types with three LLMs and two prompting strategies, demonstrating that while most hypotheses underperform, a subset can outperform baselines, and reward models generalize across LLM sources. The work highlights risks such as reward collapse and inductive bias toward SOTA methods, and outlines future directions to broaden component coverage, refine the reward mechanism, and move toward differentiable integration and broader credit assignment, thereby advancing automated ML research.

Abstract

This paper explores a top-down approach to automating incremental advances in machine learning research through component-level innovation, facilitated by Large Language Models (LLMs). Our framework systematically generates novel components, validates their feasibility, and evaluates their performance against existing baselines. A key distinction of this approach lies in how these novel components are generated. Unlike traditional AutoML and NAS methods, which often rely on a bottom-up combinatorial search over predefined, hardcoded base components, our method leverages the cross-domain knowledge embedded in LLMs to propose new components that may not be confined to any hard-coded predefined set. By incorporating a reward model to prioritize promising hypotheses, we aim to improve the efficiency of the hypothesis generation and evaluation process. We hope this approach offers a new avenue for exploration and contributes to the ongoing dialogue in the field.

Towards Automated Machine Learning Research

TL;DR

The paper proposes a top-down, LLM-driven framework for automated ML research that generates and evaluates novel ML components via a Generator–Validator–Evaluator loop, augmented by a Reward model that ranks hypotheses using and . It treats hypothesis generation as a cross-domain exploration problem, enabling the discovery of non-traditional components (e.g., activation functions, preprocessors, regularizers) and testing them in a lightweight, one-pass setting before committing substantial compute. An empirical study generates 36,000 hypotheses across three component types with three LLMs and two prompting strategies, demonstrating that while most hypotheses underperform, a subset can outperform baselines, and reward models generalize across LLM sources. The work highlights risks such as reward collapse and inductive bias toward SOTA methods, and outlines future directions to broaden component coverage, refine the reward mechanism, and move toward differentiable integration and broader credit assignment, thereby advancing automated ML research.

Abstract

This paper explores a top-down approach to automating incremental advances in machine learning research through component-level innovation, facilitated by Large Language Models (LLMs). Our framework systematically generates novel components, validates their feasibility, and evaluates their performance against existing baselines. A key distinction of this approach lies in how these novel components are generated. Unlike traditional AutoML and NAS methods, which often rely on a bottom-up combinatorial search over predefined, hardcoded base components, our method leverages the cross-domain knowledge embedded in LLMs to propose new components that may not be confined to any hard-coded predefined set. By incorporating a reward model to prioritize promising hypotheses, we aim to improve the efficiency of the hypothesis generation and evaluation process. We hope this approach offers a new avenue for exploration and contributes to the ongoing dialogue in the field.
Paper Structure (52 sections, 8 equations, 12 figures, 4 tables)

This paper contains 52 sections, 8 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Framework overview: A set of hypotheses are generated to modify a specific component of an existing (hard-coded) framework. These hypotheses are tested for validity, evaluated for performance, and ranked, with the most promising candidates undergoing a full, computationally intensive evaluation.
  • Figure 2: Activation functions generated through Incrementality Encouraging (left) and Novelty Encouraging (right) prompting.
  • Figure 3: Scatter Plots of Win Rate Probabilities: These scatter plots illustrate the relationship between the Baseline Win Rate (B-WR) and the Baseline State-of-the-Art Win Rate (BSOTA-WR) for several hypotheses across different blocks. The y-axis represents the BSOTA-WR, while the x-axis represents the B-WR. By definition, a hypothesis reaching 1 in one axis will reach 1 in the other. Also, as expected, it can be seen from the distributions that BSOTA-WR is generally a more difficult objective to achieve.
  • Figure 4: Efficiency for the activation function Reward Models. Each graph illustrates how efficiently the respective reward model prioritizes high-performing hypotheses, under different scenarios of being trained and tested on different LLM-generated hypothesis dataset.
  • Figure 5: Left Panel: Graphical representation of the top-12 proposed activation functions in one of the runs. Each graph shows the shape of the activation function along with its Baseline State-Of-The-Art Win Rate (BSOTA-WR), and novelty score (N-score). The highlighted function (in red) demonstrates a notable balance between performance (BSOTA-WR: 0.931) and novelty (N-score: 0.034). Right Panel: Self-similarity heatmap of the top-12 activation functions. The color scale represents the degree of similarity, with yellow indicating high similarity and blue indicating lower similarity. This matrix helps to identify clusters of similar functions, highlighting the uniqueness of each proposed activation function.
  • ...and 7 more figures