Table of Contents
Fetching ...

In-silico biological discovery with large perturbation models

Djordje Miladinovic, Tobias Höppe, Mathieu Chevalley, Andreas Georgiou, Lachlan Stuart, Arash Mehrjou, Marcus Bantscheff, Bernhard Schölkopf, Patrick Schwab

TL;DR

The paper presents the Large Perturbation Model (LPM), a decoder-only, PRC-disentangled framework that integrates heterogeneous perturbation experiments by learning separate embeddings for perturbations, readouts, and contexts. It demonstrates state-of-the-art performance in predicting unseen perturbation outcomes, identifying shared mechanisms across chemical and genetic perturbations, and enabling causal gene-gene network inference via imputed perturbations. The authors validate LPM’s utility through in silico PKD1 upregulation studies in autosomal dominant polycystic kidney disease and a retrospective clinical cohort, linking computational predictions to real-world outcomes while acknowledging limitations and the need for prospective validation. They also show that model performance scales with more data and contexts, supporting transfer learning across diverse perturbation screens and laying groundwork for accelerated, data-driven biological discovery. Overall, LPM offers a versatile, scalable approach to derive mechanistic insights and therapeutic hypotheses from pooled perturbation data, with potential to guide experiments and clinical decision-making.

Abstract

Data generated in perturbation experiments link perturbations to the changes they elicit and therefore contain information relevant to numerous biological discovery tasks -- from understanding the relationships between biological entities to developing therapeutics. However, these data encompass diverse perturbations and readouts, and the complex dependence of experimental outcomes on their biological context makes it challenging to integrate insights across experiments. Here, we present the Large Perturbation Model (LPM), a deep-learning model that integrates multiple, heterogeneous perturbation experiments by representing perturbation, readout, and context as disentangled dimensions. LPM outperforms existing methods across multiple biological discovery tasks, including in predicting post-perturbation transcriptomes of unseen experiments, identifying shared molecular mechanisms of action between chemical and genetic perturbations, and facilitating the inference of gene-gene interaction networks.

In-silico biological discovery with large perturbation models

TL;DR

The paper presents the Large Perturbation Model (LPM), a decoder-only, PRC-disentangled framework that integrates heterogeneous perturbation experiments by learning separate embeddings for perturbations, readouts, and contexts. It demonstrates state-of-the-art performance in predicting unseen perturbation outcomes, identifying shared mechanisms across chemical and genetic perturbations, and enabling causal gene-gene network inference via imputed perturbations. The authors validate LPM’s utility through in silico PKD1 upregulation studies in autosomal dominant polycystic kidney disease and a retrospective clinical cohort, linking computational predictions to real-world outcomes while acknowledging limitations and the need for prospective validation. They also show that model performance scales with more data and contexts, supporting transfer learning across diverse perturbation screens and laying groundwork for accelerated, data-driven biological discovery. Overall, LPM offers a versatile, scalable approach to derive mechanistic insights and therapeutic hypotheses from pooled perturbation data, with potential to guide experiments and clinical decision-making.

Abstract

Data generated in perturbation experiments link perturbations to the changes they elicit and therefore contain information relevant to numerous biological discovery tasks -- from understanding the relationships between biological entities to developing therapeutics. However, these data encompass diverse perturbations and readouts, and the complex dependence of experimental outcomes on their biological context makes it challenging to integrate insights across experiments. Here, we present the Large Perturbation Model (LPM), a deep-learning model that integrates multiple, heterogeneous perturbation experiments by representing perturbation, readout, and context as disentangled dimensions. LPM outperforms existing methods across multiple biological discovery tasks, including in predicting post-perturbation transcriptomes of unseen experiments, identifying shared molecular mechanisms of action between chemical and genetic perturbations, and facilitating the inference of gene-gene interaction networks.

Paper Structure

This paper contains 28 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Large Perturbation Model (LPM) integrates data from multiple perturbation experiments to address a range of biological discovery tasks. Perturbation experiments originating from different studies are pooled together. Each experiment is placed in the space spanned by perturbations (P), readouts (R) and experimental contexts (C), where multiple experiments generally only partially overlap in the three-dimensional (P,R,C) space (top left). A large perturbation model (LPM; central icon) is trained on pooled perturbation data and can be queried with the symbolic representation of perturbation, readout, and context of experiments of interest to generate embeddings and predict outcomes even for configurations that were not observed during training. LPM embeddings and predictions carry rich information for a range of biological discovery tasks using transfer learning (bottom).
  • Figure 2: LPM outperforms existing methods in predicting post-perturbation gene expression. We compared the performance of LPM to state-of-the-art baselines across a variety of experimental settings, contexts and for different perturbation types. a. A comparison of methods for post-perturbation expression prediction using z-normalized data including all readouts comparing Pearson correlation (y-axis) on held-out test data from eight experimental contexts (x-axis) including single-cell (replogle2022mapping), bulk (LINCS subramanian2017next), genetic (CRISPRi and CRISPR-KO) and chemical compound interventions. In addition, we performed a comparison methods for post-perturbation expression prediction that replicates the preprocessing methodology from roohani2023predicting and cui2023scgpt). In this comparison, we calculated Pearson correlation between true and predicted changes in log-normalized expression (control versus perturbed) measured on held-out test data b. for all genes and c. on the subset of the top 20 differentially expressed transcripts (y-axis). norman2019exploring includes both single and multi-perturbation data. Across all tested settings, perturbation types and contexts, LPM significantly outperforms state-of-the-art baselines. Bars indicate average performance across different seeds (dots on top of bars). Embedding ("emb" in parentheses) next to a baseline indicates that we used embeddings that were fine-tuned via a standardized Catboost strategy for evaluation (see \ref{['sec:exp-setup']} for details). For baselines without this indication, we used author instructions for generating the post-perturbation expression predictions. Not all methods are suitable for all settings that LPM operates on (e.g., GEARS roohani2023predicting requires single-cell resolved data) and are therefore not included in all comparisons. Stars indicate statistical significance (one-sided Mann-Whitney, * = $p \le 0.05$).
  • Figure 3: LPM learns a joint space of compound and CRISPR perturbations.a. The latent space of compound and CRISPR knockouts (reduced to two-dimensions via t-SNE) reflects known groupings of compound and genetic perturbations that target the same molecular mechanisms in bulk LINCS L1000 data from subramanian2017next. Genes targeted by corresponding CRISPR and compound inhibitors are color-coded in matching colors. b. Root mean squared error (RMSE) distances of known HMGCR inhibitors (statins) to the corresponding CRISPR-HMGCR perturbation in the embedding space of the LPM. Two outliers are highlighted in grey and additionally annotated in sub-figure a: benfluorex (withdrawn for cardiovascular side effects tribouilloy2011benfluorex) and pravastatin (shown to have low correlation to other statins jiang2021control and additional anti-inflammatory effects blake2000statinsmcgown2010beneficialsommeijer2004anti). c. Using the distance between LPM perturbation embeddings for chemical and genetic perturbations achieves higher recall of known inhibitors of the respective genetic target than the embeddings derived from post-perturbation L1000 transcriptome profiles.
  • Figure 4: LPM embeddings reflect rich biological relationships.a. LPM perturbation (P) embeddings (t-SNE embedded in 2D space). Each point represents a CRISPRi perturbation color-coded by molecular function of its respective genetic target from replogle2022mapping. b. LPM perturbation (P) embeddings significantly outperform existing state-of-the-art gene embeddings derived from large-scale genetic screens and public pathway and interaction databases in predicting gene function annotations from replogle2022mapping (p $\leq0.01$; one-sided Mann-Whitney-Wilcoxon test, 5 random seeds). c. LPM context (C) embeddings (2D t-SNE representation) quantify similarity between experimental contexts. Intriguingly, we found that contexts are grouped with respect to the model system under study (shown in the figure) or by type of perturbation (not shown), depending on the t-SNE random seed used.
  • Figure 5: In-silico discovery of potential therapeutics for autosomal dominant polycystic kidney disease (ADPKD).a. Using LPM, we conducted an in-silico perturbation study in which we identified clinical-stage drugs that are predicted to upregulate PKD1 in embryonic kidney cells. A lack of functional copies of PKD1 is hypothesised to be causally involved in ADPKD pathogenesis and progression hopp2012functionalrossetti2009incompletelygainullin2015polycystin. We found that triptolide, simvastatin (bold) and other statins, are the top predicted upregulators of PKD1 among clinical-stage drugs. For reference, we also include the predicted CRISPRi on vasopressin receptor 2 (AVPR2) to simulate the effect of the FDA approved AVPR2 antagonist tolvaptan torres2012tolvaptan that is mechanistically distinct wang2008vasopressin. b. Since simvastatin is commonly prescribed for cardiovascular indications, we were able to conduct a retrospective cohort study in large-scale electronic health records to further substantiate the potential efficacy of simvastatin in reducing ADPKD progression in the clinic. Most notably, we found that - among individuals diagnosed with ADPKD - 1 year or longer exposure to simvastatin (blue) is associated with a significant (p $\leq0.05$, 5-year relative risk [RR] $= 0.86$ and 10-year RR $= 0.74$) reduction in progression to end stage renal disease (ESRD) compared to those not exposed to statins (red).
  • ...and 5 more figures