Table of Contents
Fetching ...

Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

Xiaoyi Li

Abstract

When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} ($F = 1324$, $η^2 = 0.94$), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at $N = 50$, LLM-guided search reaches AP $= 0.985$ versus $0.965$ for from-scratch random search. Post-bugfix convergence follows a power law ($c = 0.11$, $R^2 = 0.93$); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.

Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

Abstract

When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} (, ), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at , LLM-guided search reaches AP versus for from-scratch random search. Post-bugfix convergence follows a power law (, ); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.
Paper Structure (74 sections, 6 equations, 5 figures, 4 tables)

This paper contains 74 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: System overview. Two LLM agents observe the shared leaderboard and propose configurations $c_t \in \mathcal{C}$. The orchestrator deduplicates proposals, schedules execution on a GPU cluster, and updates the history $H_t$. Self-healing handles runtime failures via LLM-assisted diagnosis.
  • Figure 2: Convergence of cumulative best AP. Left: full campaign (10,469 experiments) showing discrete jumps at bug-fix events (vertical dashed lines). Right: post-bugfix subset (3,003 experiments) with a cleaner power-law fit ($R^2 = 0.93$). Both $\pi_{\mathrm{rand}}$ and $\pi_{\mathrm{TPE}}$ operate on the LLM-curated pool; their faster convergence reflects sampling from an already-curated set, not superior configuration generation (see text).
  • Figure 3: Multi-agent dynamics. (a) Configuration-space entropy exhibits exploration-exploitation cycles rather than monotonic decay, with architectural entropy changing faster than training entropy. (b) Backbone distribution shift over the campaign, showing convergence from diverse backbones toward VJepa2 dominance.
  • Figure 4: AP trajectory over the 27-day campaign, showing discrete jumps at bug fixes.
  • Figure 5: Mean AP heatmap for backbone $\times$ encoder combinations. VJepa2 dominates across all encoder types, with Zipformer and Hybrid R-M performing best within VJepa2. DINOv3-B performs poorly across all encoders, suggesting the base-size model lacks capacity for this task.

Theorems & Definitions (5)

  • Definition 1: Configuration Space
  • Remark 1: Information Advantage of LLM Search
  • Definition 2: Configuration-Space Entropy
  • Definition 3: Agent Specialization
  • Definition 4: Innovation Rate