Table of Contents
Fetching ...

Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial

Zhongwei Yu, Rasul Tutunov, Alexandre Max Maraval, Zikai Xie, Zhenzhi Tan, Jiankang Wang, Zijing Li, Liangliang Xu, Qi Yang, Jun Jiang, Sanzhong Luo, Zhenxiao Guo, Haitham Bou-Ammar, Jun Wang

Abstract

Traditional scientific discovery relies on an iterative hypothesise-experiment-refine cycle that has driven progress for centuries, but its intuitive, ad-hoc implementation often wastes resources, yields inefficient designs, and misses critical insights. This tutorial presents Bayesian Optimisation (BO), a principled probability-driven framework that formalises and automates this core scientific cycle. BO uses surrogate models (e.g., Gaussian processes) to model empirical observations as evolving hypotheses, and acquisition functions to guide experiment selection, balancing exploitation of known knowledge and exploration of uncharted domains to eliminate guesswork and manual trial-and-error. We first frame scientific discovery as an optimisation problem, then unpack BO's core components, end-to-end workflows, and real-world efficacy via case studies in catalysis, materials science, organic synthesis, and molecule discovery. We also cover critical technical extensions for scientific applications, including batched experimentation, heteroscedasticity, contextual optimisation, and human-in-the-loop integration. Tailored for a broad audience, this tutorial bridges AI advances in BO with practical natural science applications, offering tiered content to empower cross-disciplinary researchers to design more efficient experiments and accelerate principled scientific discovery.

Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial

Abstract

Traditional scientific discovery relies on an iterative hypothesise-experiment-refine cycle that has driven progress for centuries, but its intuitive, ad-hoc implementation often wastes resources, yields inefficient designs, and misses critical insights. This tutorial presents Bayesian Optimisation (BO), a principled probability-driven framework that formalises and automates this core scientific cycle. BO uses surrogate models (e.g., Gaussian processes) to model empirical observations as evolving hypotheses, and acquisition functions to guide experiment selection, balancing exploitation of known knowledge and exploration of uncharted domains to eliminate guesswork and manual trial-and-error. We first frame scientific discovery as an optimisation problem, then unpack BO's core components, end-to-end workflows, and real-world efficacy via case studies in catalysis, materials science, organic synthesis, and molecule discovery. We also cover critical technical extensions for scientific applications, including batched experimentation, heteroscedasticity, contextual optimisation, and human-in-the-loop integration. Tailored for a broad audience, this tutorial bridges AI advances in BO with practical natural science applications, offering tiered content to empower cross-disciplinary researchers to design more efficient experiments and accelerate principled scientific discovery.

Paper Structure

This paper contains 155 sections, 4 theorems, 80 equations, 10 figures, 1 table, 3 algorithms.

Key Result

Theorem 3.6

Suppose a sequential optimisation algorithm produces a sequence $(x_t)_{t=1}^T$ such that $\lim_{T \to \infty} f(x_T) = f(x^*)$. Then, the average regret vanishes asymptotically: $\blacktriangleleft$$\blacktriangleleft$

Figures (10)

  • Figure 1: A four-panel comic illustrates the conceptual journey of translating human scientific intuition into a formalised, autonomous discovery framework for identifying a high-performance "ever-light" catalyst. All characters, dialogues, and events depicted in this narrative are entirely fictional and are used solely for illustrative purposes. The main characters are Professor Aris (senior researcher), Leo (PhD student), and Sarah (research engineer). The four scenes: (1) The Wall of Infinite Choices highlights the complexity of the chemical search space and the tension between logical certainty and probabilistic exploration; (2) The High Stakes of "Just One More" exposes the "exploitation trap" of settling for local optima under resource constraints versus unguided exploration; (3) The "Ghost" in the Machine poses the question of translating uncertainty into actionable instructions for autonomous experimentation; (4) The Vision of the Autonomous Lab resolves this tension via Bayesian optimization, where surrogate models quantify uncertainty, acquisition functions balance exploration/exploitation, and autonomous labs navigate the search space efficiently.
  • Figure 2: The illustration of core paradigms of scientific discovery with historical representatives. (i) Demonstrative discovery derives certain knowledge from self-evident first principles through formal logical deduction, independent of empirical observation. (ii) Inductive discovery constructs general laws by systematically accumulating empirical data via controlled experiments, inducing patterns from an evidence base. (iii) Hypothetico-deductive discovery proposes tentative hypotheses, derives testable predictions, validates them against empirical data, and refines or rejects the hypotheses through consistency checks. (iv) Bayesian hypothetico-deductive discovery is an emergent modern methodology that quantifies degrees of belief across a hypothesis space and iteratively updates these beliefs via Bayesian inference; the resulting posterior provides both a summary of robust knowledge and guidance for decision-making.
  • Figure 3: The AntBO workflow for automated antibody design (Figure 1 of khan2022antborealworldautomatedantibody). (A) Constraints on the search space: Examples of biologically unsatisfiable CDRH3 sequences (e.g., excessive repetition or invalid charge) that the optimiser must learn to avoid. (B) The Bayesian Optimisation loop: (Step 1) Starting from an initial dataset of sequences and their energy/affinity scores; (Step 2) Fitting a Gaussian Process ($\text{GP}_\theta$) to model the objective and its uncertainty; (Step 3) Maximising an acquisition function ($\alpha_\theta$) within a local trust region to propose a candidate $x^*$ that balances performance and biological plausibility; (Step 4) Evaluating the candidate via a black-box oracle (e.g., the Absolut! simulator) to generate new data and update the model.
  • Figure 4: Scientific discovery as sequential model-based optimisation: iteratively updating a model of the unknown mechanism based on experimental data, and using this model to guide the selection of future experiments.
  • Figure 5: Illustration of surrogate modelling in black-box optimisation. The red curve shows the true objective function, and red dots indicate observed data points that are slightly perturbed from the true objectives. The blue curve represents the surrogate model's predicted mean, while the shaded region denotes the 95% confidence interval, quantifying the model's uncertainty about the objective.
  • ...and 5 more figures

Theorems & Definitions (27)

  • Definition 3.1: Optimisation Problem
  • Definition 3.2: Global Optimum
  • Definition 3.3: Local Optimum
  • Definition 3.4: Convergence
  • Definition 3.5: Regret
  • Theorem 3.6: Convergence Implies Vanishing Average Regret
  • proof
  • Theorem 3.7: Sublinear Regret Implies Convergence in Average
  • proof
  • Example 4.1
  • ...and 17 more