Table of Contents
Fetching ...

Active learning for efficient discovery of optimal gene combinations in the combinatorial perturbation space

Jason Qin, Hans-Hermann Wessels, Carlos Fernandez-Granda, Yuhan Hao

TL;DR

The paper tackles the challenge of discovering optimal $2$-gene perturbations in an exponentially large combinatorial space where exhaustive CRISPR screening is infeasible. It proposes NAIAD, a data-efficient active-learning framework that uses an over-parameterized single-gene effects encoder together with $p$-dimensional adaptive gene embeddings to model additive and nonlinear gene interactions, guided by an ensemble-based uncertainty and Maximum Predicted Effects (MPE) acquisition strategy. The authors demonstrate that NAIAD achieves up to a 40% RMSE improvement over strong baselines in small-sample settings across four bulk CRISPR perturbation datasets, and that MPE sampling more effectively uncovers top perturbations than other acquisition strategies. This work enables more efficient CRISPR library design and accelerates genomics-driven therapeutic discovery by reducing the number of required experiments and enabling scalable exploration of the combinatorial perturbation space.

Abstract

The advancement of novel combinatorial CRISPR screening technologies enables the identification of synergistic gene combinations on a large scale. This is crucial for developing novel and effective combination therapies, but the combinatorial space makes exhaustive experimentation infeasible. We introduce NAIAD, an active learning framework that efficiently discovers optimal gene pairs capable of driving cells toward desired cellular phenotypes. NAIAD leverages single-gene perturbation effects and adaptive gene embeddings that scale with the training data size, mitigating overfitting in small-sample learning while capturing complex gene interactions as more data is collected. Evaluated on four CRISPR combinatorial perturbation datasets totaling over 350,000 genetic interactions, NAIAD, trained on small datasets, outperforms existing models by up to 40\% relative to the second-best. NAIAD's recommendation system prioritizes gene pairs with the maximum predicted effects, resulting in the highest marginal gain in each AI-experiment round and accelerating discovery with fewer CRISPR experimental iterations. Our NAIAD framework (https://github.com/NeptuneBio/NAIAD) improves the identification of novel, effective gene combinations, enabling more efficient CRISPR library design and offering promising applications in genomics research and therapeutic development.

Active learning for efficient discovery of optimal gene combinations in the combinatorial perturbation space

TL;DR

The paper tackles the challenge of discovering optimal -gene perturbations in an exponentially large combinatorial space where exhaustive CRISPR screening is infeasible. It proposes NAIAD, a data-efficient active-learning framework that uses an over-parameterized single-gene effects encoder together with -dimensional adaptive gene embeddings to model additive and nonlinear gene interactions, guided by an ensemble-based uncertainty and Maximum Predicted Effects (MPE) acquisition strategy. The authors demonstrate that NAIAD achieves up to a 40% RMSE improvement over strong baselines in small-sample settings across four bulk CRISPR perturbation datasets, and that MPE sampling more effectively uncovers top perturbations than other acquisition strategies. This work enables more efficient CRISPR library design and accelerates genomics-driven therapeutic discovery by reducing the number of required experiments and enabling scalable exploration of the combinatorial perturbation space.

Abstract

The advancement of novel combinatorial CRISPR screening technologies enables the identification of synergistic gene combinations on a large scale. This is crucial for developing novel and effective combination therapies, but the combinatorial space makes exhaustive experimentation infeasible. We introduce NAIAD, an active learning framework that efficiently discovers optimal gene pairs capable of driving cells toward desired cellular phenotypes. NAIAD leverages single-gene perturbation effects and adaptive gene embeddings that scale with the training data size, mitigating overfitting in small-sample learning while capturing complex gene interactions as more data is collected. Evaluated on four CRISPR combinatorial perturbation datasets totaling over 350,000 genetic interactions, NAIAD, trained on small datasets, outperforms existing models by up to 40\% relative to the second-best. NAIAD's recommendation system prioritizes gene pairs with the maximum predicted effects, resulting in the highest marginal gain in each AI-experiment round and accelerating discovery with fewer CRISPR experimental iterations. Our NAIAD framework (https://github.com/NeptuneBio/NAIAD) improves the identification of novel, effective gene combinations, enabling more efficient CRISPR library design and offering promising applications in genomics research and therapeutic development.

Paper Structure

This paper contains 23 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of active learning framework in CRISPR combinatorial perturbation (A) and our NAIAD model architecture with overparameterized single-gene effects and adaptive gene embedding modules (B).
  • Figure 2: Performance on test data of different gene embedding settings in NAIAD using the Norman dataset (4429 training combinations) across varying training data sizes, reported as $\log(\text{Mean Square Error})$. The models without gene embeddings or with low-dimensional embeddings perform well with small training data but do not improve as more data are added. In contrast, the MLP model with larger gene embeddings outperforms these models when the training data exceeds 30%. The adaptive embedding approach achieves the best performance across all training data sizes.
  • Figure 3: Benchmark analysis comparing the NAIAD model with GEARS and RECOVER models, evaluated using test data $\log(\text{MSE})$ across different numbers of gene combinations in training data. Error bars are SE across three cross-fold replicates.
  • Figure 4: Comparison of different acquisition functions evaluated by top $N$ prediction accuracy for the top $N$ perturbations across four iteration rounds (see Appendix D for full description of accuracy metric). Error bars are SE from three cross-fold replicates.
  • Figure 5: Evaluation of compressed single-gene effects and gene embeddings shows a strong correlation between the compressed single-gene effects and the values predicted by the linear model. As the training data increases, the correlation between gene embeddings and the residuals of the linear model predictions gradually becomes stronger.
  • ...and 3 more figures