From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Sirui Xia; Yikai Zhang; Aili Chen; Siye Wu; Siyu Yuan; Yanghua Xiao

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Sirui Xia, Yikai Zhang, Aili Chen, Siye Wu, Siyu Yuan, Yanghua Xiao

Abstract

Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Abstract

Paper Structure (51 sections, 24 equations, 7 figures, 4 tables)

This paper contains 51 sections, 24 equations, 7 figures, 4 tables.

Introduction
Related Work
LLM-driven Scientific Discovery
Policy Optimization for Large Language Models
Method
Problem Modeling and Methodology
Phase I: Proposal Generation via Epistemic Evolutionary Search
Lineage Prioritization.
Context Construction from the Archive and Literature.
Population Evolution & Selection.
Phase II: Implementation, Verification, and Evaluation
Implementation and Verification under Shared Interfaces.
Standardized Evaluation Protocol.
Phase III: Reflective Analysis & Archive Update
Reflective Analysis.
...and 36 more sections

Figures (7)

Figure 1: Overview of POISE. Phase I (Proposal Generation) generates candidate algorithms from historical evidence and prior knowledge; Phase II (Implementation, Verification, and Evaluation) implements proposals under shared interfaces, verifies fidelity to the intended algorithm, and evaluates them with a standardized protocol; Phase III (Reflective Analysis and Archive Update) interprets the results and updates the archive.
Figure 2: Training dynamics in the main experiment for GRPO and representative evolved variants. The plots compare entropy, reward, and response length over training.
Figure 3: Accuracy-length trade-offs under the length-compression constraint.
Figure 4: Performance frontier vs. tree depth in the base-branch lineage; dots are algorithms. Solid: cumulative best Overall; dashed: mean Overall of top-3 at each depth. Despite exploratory failures at depth, the frontier improves beyond the root.
Figure 5: Evolutionary lineage of GRPO variants in the main run. Nodes denote algorithms, arrows denote parent-child inheritance, and node color encodes the weighted Overall (lighter is lower).
...and 2 more figures

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Abstract

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Authors

Abstract

Table of Contents

Figures (7)