Table of Contents
Fetching ...

SelfAI: Building a Self-Training AI System with LLM Agents

Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng, Xiaobing Yu, Yu Zhong, Shangqi Deng, Ufaq Khan, Jianghao Wu, Xiaofeng Liu, Imran Razzak, Xiaojun Chang, Yutong Xie

TL;DR

SelfAI addresses limitations of autonomous scientific discovery systems by introducing a general multi-agent framework that couples a User Agent, a reasoning-driven Cognitive Agent with optimal stopping, and an Experiment Manager to orchestrate large-scale, fault-tolerant experiments. It defines two novel metrics, $Score$ and $ ext{AUP}_D$, to quantify discovery efficiency and search diversity, and demonstrates strong, cross-domain performance with reduced redundant trials compared to Bayesian optimization and pure LLM baselines. Across 12 tasks in 6 domains, SelfAI consistently achieves favorable trajectories and early stopping, while maintaining interaction with human researchers to guide exploration. The work lays a practical blueprint for human-AI collaborative scientific discovery and outlines avenues for memory integration, retrieval-augmented reasoning, and autonomous tooling to further enhance cognitive autonomy in research.

Abstract

Recent work on autonomous scientific discovery has leveraged LLM-based agents to integrate problem specification, experiment planning, and execution into end-to-end systems. However, these frameworks are often confined to narrow application domains, offer limited real-time interaction with researchers, and lack principled mechanisms for determining when to halt exploration, resulting in inefficiencies, reproducibility challenges, and under-utilized human expertise. To address these gaps, we propose \textit{SelfAI}, a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations, a Cognitive Agent powered by LLMs with optimal stopping criteria to iteratively refine hyperparameter searches, and an Experiment Manager responsible for orchestrating parallel, fault-tolerant training workflows across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback. We further introduce two novel evaluation metrics, Score and $\text{AUP}_D$, to quantify discovery efficiency and search diversity. Across regression, NLP, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials compared to classical Bayesian optimization and LLM-based baselines, while enabling seamless interaction with human researchers.

SelfAI: Building a Self-Training AI System with LLM Agents

TL;DR

SelfAI addresses limitations of autonomous scientific discovery systems by introducing a general multi-agent framework that couples a User Agent, a reasoning-driven Cognitive Agent with optimal stopping, and an Experiment Manager to orchestrate large-scale, fault-tolerant experiments. It defines two novel metrics, and , to quantify discovery efficiency and search diversity, and demonstrates strong, cross-domain performance with reduced redundant trials compared to Bayesian optimization and pure LLM baselines. Across 12 tasks in 6 domains, SelfAI consistently achieves favorable trajectories and early stopping, while maintaining interaction with human researchers to guide exploration. The work lays a practical blueprint for human-AI collaborative scientific discovery and outlines avenues for memory integration, retrieval-augmented reasoning, and autonomous tooling to further enhance cognitive autonomy in research.

Abstract

Recent work on autonomous scientific discovery has leveraged LLM-based agents to integrate problem specification, experiment planning, and execution into end-to-end systems. However, these frameworks are often confined to narrow application domains, offer limited real-time interaction with researchers, and lack principled mechanisms for determining when to halt exploration, resulting in inefficiencies, reproducibility challenges, and under-utilized human expertise. To address these gaps, we propose \textit{SelfAI}, a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations, a Cognitive Agent powered by LLMs with optimal stopping criteria to iteratively refine hyperparameter searches, and an Experiment Manager responsible for orchestrating parallel, fault-tolerant training workflows across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback. We further introduce two novel evaluation metrics, Score and , to quantify discovery efficiency and search diversity. Across regression, NLP, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials compared to classical Bayesian optimization and LLM-based baselines, while enabling seamless interaction with human researchers.

Paper Structure

This paper contains 26 sections, 11 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: SelfAI Framework for Automated Scientific Experimentation.a, Holistic architecture of the multi-agent system, which transforms various experiments in the research process into a structured workflow. b, User intentions, comprising ideas and experiment schemes, are transformed into structured configurations via a predefined prompt. These inputs are processed through successive stages: hypothesis generation, strategic planning, trial execution, and result collection. c, Performance distribution across 11 tasks demonstrates the framework's ability, when powered by GPT-4o-mini, to prioritize high-performance regions without sacrificing exploration. d, Trial counts for each solver shown in c, accompanied by quantile lines, density distributions, and performance variability across the global and two evaluation regions. Higher values in low-performance regions promote rapid escape, while lower values in high-performance regions enable localized refinement.
  • Figure 2: Scores among all solvers across different tasks to measure the best stopping criterion.
  • Figure 3: Diverse Metrics ($\text{AUP}_D$) among all solvers across different tasks to evaluate trajectory diversity.
  • Figure 4: Illustration of the optimized trajectory for the SIREN method for image segmentation. Green points are suggested points before reaching the optimal points. Red points are redundant suggestions when reaching out to the optimal points and failing to stop trials. The $\star$ is the optimal point. We show the serialization recommendations provided by LLM through the labeled numbers.
  • Figure 5: Illustration of the Cognitive Agent. The overall reasoning process involves several key steps: Hypothesis Generation (analysis of the current task and completed trials), Stopping Judgment, and Strategic Planning. Strategic Planning develops experimental schemes based on the analyzed hypotheses.
  • ...and 6 more figures