AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Nilesh Jain; Rohit Yadav; Sagar Kotian; Claude AI

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Nilesh Jain, Rohit Yadav, Sagar Kotian, Claude AI

TL;DR

This work formalises this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrates empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

Abstract

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

TL;DR

Abstract

Paper Structure (39 sections, 3 theorems, 10 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 39 sections, 3 theorems, 10 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Contributions.
Background & Related Work
Neural Architecture Search
AutoML and Meta-Learning
LLM-Driven Code Synthesis and Agents
Self-Play and Perpetual Learning
Problem Formulation
Markov Decision Process
The Bits-Per-Byte Metric
Fixed Time Budget and Comparability
The AutoResearch-RL Agent
Policy Architecture
PPO Objective
Experiment History as Working Memory
...and 24 more sections

Key Result

Proposition 1

Under a fixed time budget $T_{\max}$ and identical hardware, the val-bpb ordering between any two configurations $c$ and $c'$ reflects a genuine capability difference, not an artefact of different iteration counts.

Figures (3)

Figure 1: AutoResearch-RL system overview. The RL agent proposes code edits, the training environment executes them under a fixed time budget, the self-evaluator monitors progress and can abort early, and the resulting reward updates both the policy and the experiment history buffer. The loop runs indefinitely.
Figure 2: Best val-bpb as a function of experiment index. AutoResearch-RL discovers improvements faster and reaches a lower final value.
Figure 3: Cumulative experiments completed with and without the self-evaluation (SE) early-stop module. SE yields $\approx1.35\times$ more experiments per GPU-hour.

Theorems & Definitions (6)

Definition 1: Research MDP
Proposition 1: Metric Comparability
Theorem 2: Monotone Improvement
proof : Proof sketch
Remark 1
Proposition 3: Sample Complexity Bound

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

TL;DR

Abstract

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (6)