Table of Contents
Fetching ...

An AI system to help scientists write expert-level empirical software

Eser Aygün, Anastasiya Belyaeva, Gheorghe Comanici, Marc Coram, Hao Cui, Jake Garrison, Renee Johnston Anton Kast, Cory Y. McLean, Peter Norgaard, Zahra Shamsi, David Smalling, James Thompson, Subhashini Venugopalan, Brian P. Williams, Chujun He, Sarah Martinson, Martyna Plomecka, Lai Wei, Yuchen Zhou, Qian-Ze Zhu, Matthew Abraham, Erica Brand, Anna Bulanova, Jeffrey A. Cardille, Chris Co, Scott Ellsworth, Grace Joseph, Malcolm Kane, Ryan Krueger, Johan Kartiwa, Dan Liebling, Jan-Matthis Lueckmann, Paul Raccuglia, Xuefei, Wang, Katherine Chou, James Manyika, Yossi Matias, John C. Platt, Lizzie Dorfman, Shibl Mourad, Michael P. Brenner

TL;DR

The paper presents an AI system that combines a large language model with tree search to automatically generate, mutate, and evaluate expert-level empirical software aimed at solving scorable scientific tasks. By rewriting code and exploring a vast solution space, the approach achieves expert-level performance across diverse domains, including scRNA-seq batch integration, COVID-19 forecasting, geospatial segmentation, and neural activity prediction, often outperforming established human-developed methods. Key contributions include demonstrated gains via recombination of existing methods and the integration of external research ideas (via Gemini embeddings, Deep Research, and AI co-scientists), effectively accelerating scientific discovery. The work positions automated empirical software generation as a viable path to rapidly advancing scientific progress, reducing exploration time from weeks or months to hours or days. These results have broad implications for domains where task performance can be machine-scored and suggest a generalizable framework for AI-driven scientific software synthesis.

Abstract

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.

An AI system to help scientists write expert-level empirical software

TL;DR

The paper presents an AI system that combines a large language model with tree search to automatically generate, mutate, and evaluate expert-level empirical software aimed at solving scorable scientific tasks. By rewriting code and exploring a vast solution space, the approach achieves expert-level performance across diverse domains, including scRNA-seq batch integration, COVID-19 forecasting, geospatial segmentation, and neural activity prediction, often outperforming established human-developed methods. Key contributions include demonstrated gains via recombination of existing methods and the integration of external research ideas (via Gemini embeddings, Deep Research, and AI co-scientists), effectively accelerating scientific discovery. The work positions automated empirical software generation as a viable path to rapidly advancing scientific progress, reducing exploration time from weeks or months to hours or days. These results have broad implications for domains where task performance can be machine-scored and suggest a generalizable framework for AI-driven scientific software synthesis.

Abstract

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.

Paper Structure

This paper contains 7 sections, 1 equation, 26 figures, 16 tables, 1 algorithm.

Figures (26)

  • Figure 1: Schematic and performance of our method. a, Schematic of our method algorithm. A scorable task, together with research ideas proposing methods to solve the task, are fed to an LLM, which produces code to evaluate the scorable task in a sandbox. This is then embedded within a tree search algorithm, whereby new nodes are chosen balancing exploitation and exploration, sampling from the LLM (Methods). b, Performance of code generation methods on Kaggle Playground benchmark. Results report the average public leaderboard percentile performance over 16 tasks. Methods based on our method are listed in bold. Error bars indicate standard deviation. BDT, boosted decision tree. c, Mechanisms used to create initial research ideas to solve scientific problems.
  • Figure 1: Experimental design for single-cell batch integration.a, We sourced our tree search development dataset from CELLxGENE. After filtering and manually selecting the dataset 364bd0c7-f7fd-48ed-99c1-ae26872b1042 version ffdaa1f0-b1d1-4135-8774-9fed7bf039ba (see Methods), which has a similar profile to the six datasets used in the OpenProblems.bio Batch Integration benchmark (distinct datasets also in CELLxGENE), we sampled 20,000 cells for the training split and 20,000 for the validation split. b, For each of the 11 base methods, we generated a detailed method description and inserted it into a prompt to initialize the tree search. We ran three independent tree search replicas per method, using the training split for hill climbing. From each tree, we selected the top-performing node based on its training score. We then evaluated each top node's script on the validation split and selected the best one based on validation performance. The best implementation per method was finally evaluated on the OpenProblems.bio holdout datasets, and the corresponding scores are reported as final results.
  • Figure 2: Performance of tree search on scRNA-seq batch integration.a, Schematic of the batch integration task, in which disparate datasets (teal and red) are processed to remove batch effects in the data while retaining biological variability. b, Performance of tree search (method names bolded and suffixed by "(TS)") compared to the analogous published method on the OpenProblems benchmark v2.0.0 luecken2025. "Perfect embedding by celltype with jitter" is a positive control method that represents the best possible performance and "Shuffle integration by batch" is a negative control that does not perform any batch integration. Overall score is the mean over all datasets and metrics. Each Datasets column shows the mean of all metrics computed over that dataset. Each Metrics column shows the mean of that metric computed over all datasets. Metrics were assigned a value of 0 if they could not be computed or if their performance was worse than the lowest negative control; these are displayed as empty. c, Performance improvements annotated with code innovation for the top-performing batch balanced $k$-nearest neighbors (BBKNN) implementation. ComBat-based embedding generation was introduced in implementation attempt 429. d, Overall score for OpenProblems benchmark v2.0.0 luecken2025 non-control methods, our method with and without recombination of ideas, Gemini Deep Research GeminiDeepResearch, and our method with AI co-scientist gottweis2025towards. Y-axis lower bound is the overall score of the "Shuffle integration by batch" negative control method. Seven recombination, five base methods, and two AI co-scientist methods that do not match its performance are omitted. * indicates the method is a recombination, even if not explicitly prompted for recombination. TS, tree search; fastMNN, batchelor fastMNN; mnnCorrect, batchelor mnnCorrect.
  • Figure 2: Uniform Manifold Approximation and Projection mcinnes2018 of BBKNN (TS) on the Immune Cell Atlas dataset.a, The UMAP projection colored by cell type shows cell-type-specific clusters. b, The UMAP projection colored by data batch shows good batch mixing across the dataset.
  • Figure 3: Performance of tree search on COVID-19 forecasting.a, Rolling validation window used for the forecasting experiments. Each search's output is validated internally on a preceding block of time (blue), and the resulting model is then used to make predictions for its corresponding forecasting period (orange). Training data includes all dates on or after 2020-08-08 and prior to the validation set. b, Time-series leaderboard showing weekly forecasting performance (Average WIS) for participating teams and our 'Google Retrospective' model, ordered by average WIS. Scores are aggregated across all 52 jurisdictions and four forecast horizons. The number within each cell is the model's absolute Average WIS for that week. The cell's background color visualizes the performance relative to the CovidHub-ensemble, with blue indicating a lower (better) WIS and red indicating a higher (worse) WIS. c, Direct jurisdiction-level comparison of forecasting error (Average WIS) between our model and the 'CovidHub-ensemble', demonstrating our model's superior performance in a majority of locations. d, Geographic distribution of our model's forecasting error (Average WIS), aggregated over the entire 2024/25 COVID-19 season. Lower error values (lighter colors) indicate better performance. e, Comparison of aggregate forecasting performance for various modeling strategies. This includes baseline models from the CovidHub competition, our retrospective model, our replications of submitted models, novel hybrid models generated through recombination, deep researchGeminiDeepResearch and AI co-scientistgottweis2025towards. 14 strategies (10 recombination; two Deep Research; one AI co-scientist and one replicated baseline) outperform the official CovidHub-ensemble for the 3-week (3 reference dates × 4 time horizons × 52 jurisdictions) evaluation period. Models that perform worse than CovidHub-baseline are not shown.
  • ...and 21 more figures