Table of Contents
Fetching ...

Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela

Abstract

The autoresearch repository enables an LLM agent to search for optimal hyperparameter configurations on an unconstrained search space by editing the training code directly. Given a fixed compute budget and constraints, we use autoresearch as a testbed to compare classical hyperparameter optimization (HPO) algorithms against LLM-based methods on tuning the hyperparameters of a small language model. Within a fixed hyperparameter search space, classical HPO methods such as CMA-ES and TPE consistently outperform LLM-based agents. However, an LLM agent that directly edits training source code in an unconstrained search space narrows the gap to classical methods substantially despite using only a self-hosted open-weight 27B model. Methods that avoid out-of-memory failures outperform those with higher search diversity, suggesting that reliability matters more than exploration breadth. While small and mid-sized LLMs struggle to track optimization state across trials, classical methods lack domain knowledge. To bridge this gap, we introduce Centaur, a hybrid that shares CMA-ES's internal state, including mean vector, step-size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, with its 0.8B variant outperforming the 27B variant, suggesting that a cheap LLM suffices when paired with a strong classical optimizer. The 0.8B model is insufficient for unconstrained code editing but sufficient for hybrid optimization, while scaling to 27B provides no advantage for fixed search space methods. Preliminary experiments with the frontier model Gemini 3.1 Pro Preview do not close the gap to classical methods. Code is available at https://github.com/ferreirafabio/autoresearch-automl.

Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Abstract

The autoresearch repository enables an LLM agent to search for optimal hyperparameter configurations on an unconstrained search space by editing the training code directly. Given a fixed compute budget and constraints, we use autoresearch as a testbed to compare classical hyperparameter optimization (HPO) algorithms against LLM-based methods on tuning the hyperparameters of a small language model. Within a fixed hyperparameter search space, classical HPO methods such as CMA-ES and TPE consistently outperform LLM-based agents. However, an LLM agent that directly edits training source code in an unconstrained search space narrows the gap to classical methods substantially despite using only a self-hosted open-weight 27B model. Methods that avoid out-of-memory failures outperform those with higher search diversity, suggesting that reliability matters more than exploration breadth. While small and mid-sized LLMs struggle to track optimization state across trials, classical methods lack domain knowledge. To bridge this gap, we introduce Centaur, a hybrid that shares CMA-ES's internal state, including mean vector, step-size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, with its 0.8B variant outperforming the 27B variant, suggesting that a cheap LLM suffices when paired with a strong classical optimizer. The 0.8B model is insufficient for unconstrained code editing but sufficient for hybrid optimization, while scaling to 27B provides no advantage for fixed search space methods. Preliminary experiments with the frontier model Gemini 3.1 Pro Preview do not close the gap to classical methods. Code is available at https://github.com/ferreirafabio/autoresearch-automl.

Paper Structure

This paper contains 18 sections, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Best Validation Bits-Per-Byte (mean $\pm$ std across 3 seeds) of HPO algorithms against cumulative training time. All methods receive the same 24-hour GPU training budget; LLM inference overhead is excluded. All LLM-based methods use Qwen3.5-27B as the LLM optimizer. Classical methods such as CMA-ES and TPE converge faster and to better final values than LLM-based methods. Centaur, our CMA-ES and LLM hybrid, achieves the best result in our experiments.
  • Figure 2: 0.8B vs 27B LLM optimizer comparison (wall-time). Solid: 27B, dashed: 0.8B. TPE and Random shown as classical references. The 0.8B model appears insufficient for unconstrained code editing but sufficient for hybrid optimization.
  • Figure 3: Gemini 3.1 Pro Preview vs Qwen3.5-27B for Karpathy Agent (Code) and Centaur. Solid: Gemini (single seed, due to API cost constraints), dashed: Qwen (3 seeds). TPE shown as reference.
  • Figure 4: Convergence by trial number (mean $\pm$ std across 3 seeds). Same methods as \ref{['fig:convergence']}. Trial-number view shows sample efficiency rather than wall-clock cost.
  • Figure 5: 0.8B vs 27B by trial number (mean $\pm$ std across 3 seeds). Same methods as \ref{['fig:modelsize']}.
  • ...and 6 more figures