Table of Contents
Fetching ...

DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, Yue Zhang

TL;DR

DeepScientist reframes scientific discovery as goal-directed Bayesian optimization, coupling a three-stage Hypothesize–Verify–Analyze loop with a continuously growing Findings Memory to manage exploration vs. exploitation. The system autonomously iterates across candidate methods, implements promising ideas, and validates them at progressively higher fidelity, reporting SOTA progress on three frontier AI tasks within a month-long cycle. Key contributions include the hierarchical evaluation framework, a multi-agent architecture with surrogate evaluation and automated reporting, and the first large-scale demonstration that AI-driven discovery can progressively exceed human-constructed SOTA under constrained budgets. The work also highlights significant bottlenecks, notably the low success rate of ideas and the critical need for robust verification and experimental design, suggesting a future of richer human-AI collaboration to accelerate discovery responsibly.

Abstract

While previous AI Scientist systems can generate novel findings, they often lack the focus to produce scientifically valuable contributions that address pressing human-defined challenges. We introduce DeepScientist, a system designed to overcome this by conducting goal-oriented, fully autonomous scientific discovery over month-long timelines. It formalizes discovery as a Bayesian Optimization problem, operationalized through a hierarchical evaluation process consisting of "hypothesize, verify, and analyze". Leveraging a cumulative Findings Memory, this loop intelligently balances the exploration of novel hypotheses with exploitation, selectively promoting the most promising findings to higher-fidelity levels of validation. Consuming over 20,000 GPU hours, the system generated about 5,000 unique scientific ideas and experimentally validated approximately 1100 of them, ultimately surpassing human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by 183.7\%, 1.9\%, and 7.9\%. This work provides the first large-scale evidence of an AI achieving discoveries that progressively surpass human SOTA on scientific tasks, producing valuable findings that genuinely push the frontier of scientific discovery. To facilitate further research into this process, we will open-source all experimental logs and system code at https://github.com/ResearAI/DeepScientist/.

DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

TL;DR

DeepScientist reframes scientific discovery as goal-directed Bayesian optimization, coupling a three-stage Hypothesize–Verify–Analyze loop with a continuously growing Findings Memory to manage exploration vs. exploitation. The system autonomously iterates across candidate methods, implements promising ideas, and validates them at progressively higher fidelity, reporting SOTA progress on three frontier AI tasks within a month-long cycle. Key contributions include the hierarchical evaluation framework, a multi-agent architecture with surrogate evaluation and automated reporting, and the first large-scale demonstration that AI-driven discovery can progressively exceed human-constructed SOTA under constrained budgets. The work also highlights significant bottlenecks, notably the low success rate of ideas and the critical need for robust verification and experimental design, suggesting a future of richer human-AI collaboration to accelerate discovery responsibly.

Abstract

While previous AI Scientist systems can generate novel findings, they often lack the focus to produce scientifically valuable contributions that address pressing human-defined challenges. We introduce DeepScientist, a system designed to overcome this by conducting goal-oriented, fully autonomous scientific discovery over month-long timelines. It formalizes discovery as a Bayesian Optimization problem, operationalized through a hierarchical evaluation process consisting of "hypothesize, verify, and analyze". Leveraging a cumulative Findings Memory, this loop intelligently balances the exploration of novel hypotheses with exploitation, selectively promoting the most promising findings to higher-fidelity levels of validation. Consuming over 20,000 GPU hours, the system generated about 5,000 unique scientific ideas and experimentally validated approximately 1100 of them, ultimately surpassing human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by 183.7\%, 1.9\%, and 7.9\%. This work provides the first large-scale evidence of an AI achieving discoveries that progressively surpass human SOTA on scientific tasks, producing valuable findings that genuinely push the frontier of scientific discovery. To facilitate further research into this process, we will open-source all experimental logs and system code at https://github.com/ResearAI/DeepScientist/.

Paper Structure

This paper contains 17 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 2: The autonomous, closed-loop discovery process of DeepScientist. The system iterates through a three-stage cycle, learning from both human knowledge and its own experiments.
  • Figure 3: Performance evaluation of DeepScientist across three research domains: (a-b) Agent Failure Attribution on Who&When benchmark in handcraft and algorithm-generated settings; (c) LLM Inference Acceleration on MBPP dataset; (d) AI Text Detection with performance-latency tradeoff analysis. DeepScientist (shown in pink) consistently outperform human-designed SoTA approaches (shown in blue) across all tasks.
  • Figure 4: DeepScientist's experimental statistics. (a) The research pipeline from generated ideas to validated progress. (b) Success rates comparing our selection strategy against a baseline. (c) Distribution of wall-clock execution times for all implemented trials.
  • Figure 5: Visualization of the conceptual search space for the AI text detection task. The plot shows a t-SNE visualization of the semantic embeddings for all 2,472 generated ideas. Markers identify the initial SOTA method (Initial Idea) and the three final SOTA-surpassing methods (Progress Ideas).
  • Figure 6: Scaling analysis of autonomous scientific discovery. The plot illustrates the relationship between parallel computational resources (number of GPUs) and the number of SOTA-surpassing "Progress Findings" found by DeepScientist across all tasks within a one-week period.
  • ...and 4 more figures