Table of Contents
Fetching ...

BioBO: Biology-informed Bayesian Optimization for Perturbation Design

Yanke Li, Tianyu Cui, Tommaso Mansi, Mangal Prakash, Rui Liao

Abstract

Efficient design of genomic perturbation experiments is crucial for accelerating drug discovery and therapeutic target identification, yet exhaustive perturbation of the human genome remains infeasible due to the vast search space of potential genetic interactions and experimental constraints. Bayesian optimization (BO) has emerged as a powerful framework for selecting informative interventions, but existing approaches often fail to exploit domain-specific biological prior knowledge. We propose Biology-Informed Bayesian Optimization (BioBO), a method that integrates Bayesian optimization with multimodal gene embeddings and enrichment analysis, a widely used tool for gene prioritization in biology, to enhance surrogate modeling and acquisition strategies. BioBO combines biologically grounded priors with acquisition functions in a principled framework, which biases the search toward promising genes while maintaining the ability to explore uncertain regions. Through experiments on established public benchmarks and datasets, we demonstrate that BioBO improves labeling efficiency by 25-40%, and consistently outperforms conventional BO by identifying top-performing perturbations more effectively. Moreover, by incorporating enrichment analysis, BioBO yields pathway-level explanations for selected perturbations, offering mechanistic interpretability that links designs to biologically coherent regulatory circuits.

BioBO: Biology-informed Bayesian Optimization for Perturbation Design

Abstract

Efficient design of genomic perturbation experiments is crucial for accelerating drug discovery and therapeutic target identification, yet exhaustive perturbation of the human genome remains infeasible due to the vast search space of potential genetic interactions and experimental constraints. Bayesian optimization (BO) has emerged as a powerful framework for selecting informative interventions, but existing approaches often fail to exploit domain-specific biological prior knowledge. We propose Biology-Informed Bayesian Optimization (BioBO), a method that integrates Bayesian optimization with multimodal gene embeddings and enrichment analysis, a widely used tool for gene prioritization in biology, to enhance surrogate modeling and acquisition strategies. BioBO combines biologically grounded priors with acquisition functions in a principled framework, which biases the search toward promising genes while maintaining the ability to explore uncertain regions. Through experiments on established public benchmarks and datasets, we demonstrate that BioBO improves labeling efficiency by 25-40%, and consistently outperforms conventional BO by identifying top-performing perturbations more effectively. Moreover, by incorporating enrichment analysis, BioBO yields pathway-level explanations for selected perturbations, offering mechanistic interpretability that links designs to biologically coherent regulatory circuits.

Paper Structure

This paper contains 48 sections, 10 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: BioBO pipeline for perturbation design. We make two methodological innovations: (i). Fusion of gene modalities to improve surrogate modeling; (ii). Enrichment analysis on top of surrogate model predictions to strengthen gene acquisition via incorporating biological information.
  • Figure 2: Performance across single modalities (Achilles, Gene2Vec, GenePT) and their Fusion on IFN-$\gamma$ (top) and IL-2 (bottom). Row-wise dashed lines indicate the Fusion value at the final cycle (20) for UCB, EI, TS, and DiscoBAX to aid comparison. We observe that BO with Fusion is better than BO with any single modality.
  • Figure 3: Relations between performance of BO and the surrogate model. We observe that Fusion (red) does not improve the surrogate model globally (LL global, first column). However, it improves on data points that are near optimum (LL@top-1% to LL@top-10%), which explains the improvement on BO results (top-k recall). Specifically, the top-k recall of BO is more correlated with local LL than global LL, measured by both Spearman and Pearson correlation.
  • Figure 4: Performance of pure EA and BioUCB on Achilles. (a): Pure EA on IFN-$\gamma$ and IL-2. We observe that pure EA provides better designs than random. (b): BioUCB on Achilles for IFN-$\gamma$ and IL-2. We observe that BioUCB provides better designs than UCB and pure EA.
  • Figure 5: CCLE and STRING modalities across datasets. Panels (left→right): IFN-$\gamma$—CCLE, IFN-$\gamma$—STRING, IL-2—CCLE, IL-2—STRING. Curves show base acquisitions UCB/EI/TS (solid), biology-informed variants BioUCB/BioEI/BioTS with GO (dotted) and HM (dash–dot) in the same family color, plus Random (gray). Shaded ribbons denote mean $\pm$ s.e.m.
  • ...and 7 more figures