Table of Contents
Fetching ...

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Sirui Xia, Yikai Zhang, Aili Chen, Siye Wu, Siyu Yuan, Yanghua Xiao

Abstract

Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Abstract

Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.
Paper Structure (51 sections, 24 equations, 7 figures, 4 tables)

This paper contains 51 sections, 24 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of POISE. Phase I (Proposal Generation) generates candidate algorithms from historical evidence and prior knowledge; Phase II (Implementation, Verification, and Evaluation) implements proposals under shared interfaces, verifies fidelity to the intended algorithm, and evaluates them with a standardized protocol; Phase III (Reflective Analysis and Archive Update) interprets the results and updates the archive.
  • Figure 2: Training dynamics in the main experiment for GRPO and representative evolved variants. The plots compare entropy, reward, and response length over training.
  • Figure 3: Accuracy-length trade-offs under the length-compression constraint.
  • Figure 4: Performance frontier vs. tree depth in the base-branch lineage; dots are algorithms. Solid: cumulative best Overall; dashed: mean Overall of top-3 at each depth. Despite exploratory failures at depth, the frontier improves beyond the root.
  • Figure 5: Evolutionary lineage of GRPO variants in the main run. Nodes denote algorithms, arrows denote parent-child inheritance, and node color encodes the weighted Overall (lighter is lower).
  • ...and 2 more figures