Table of Contents
Fetching ...

TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation Prediction

Frederik Wenkel, Wilson Tu, Cassandra Masschelein, Hamed Shirzad, Cian Eastwood, Shawn T. Whitfield, Ihab Bendidi, Craig Russell, Liam Hodgson, Yassir El Mesbahi, Jiarui Ding, Marta M. Fay, Berton Earnshaw, Emmanuel Noutahi, Alisandra K. Denton

TL;DR

TxPert introduces a unified, knowledge-graph–guided framework for predicting transcriptomic perturbation effects under out-of-distribution conditions. It combines a basal state encoder with a perturbation encoder that leverages multiple gene–gene interaction networks, fusing information through latent transfer and decoding to forecast perturbation-induced expression across unseen single and combinatorial perturbations and across novel cell lines. The work presents rigorous metric design, extensive ablations, and a multi-graph benchmarking approach that demonstrates state-of-the-art performance and robust generalization, addressing prior concerns about foundation-models in perturbation biology. The framework provides a practical path toward scalable in silico perturbation prediction with potential to accelerate drug discovery, cross-context extrapolation, and personalized medicine, while outlining future directions in few-shot/active learning and expanded evaluation protocols.

Abstract

Accurately predicting cellular responses to genetic perturbations is essential for understanding disease mechanisms and designing effective therapies. Yet exhaustively exploring the space of possible perturbations (e.g., multi-gene perturbations or across tissues and cell types) is prohibitively expensive, motivating methods that can generalize to unseen conditions. In this work, we explore how knowledge graphs of gene-gene relationships can improve out-of-distribution (OOD) prediction across three challenging settings: unseen single perturbations; unseen double perturbations; and unseen cell lines. In particular, we present: (i) TxPert, a new state-of-the-art method that leverages multiple biological knowledge networks to predict transcriptional responses under OOD scenarios; (ii) an in-depth analysis demonstrating the impact of graphs, model architecture, and data on performance; and (iii) an expanded benchmarking framework that strengthens evaluation standards for perturbation modeling.

TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation Prediction

TL;DR

TxPert introduces a unified, knowledge-graph–guided framework for predicting transcriptomic perturbation effects under out-of-distribution conditions. It combines a basal state encoder with a perturbation encoder that leverages multiple gene–gene interaction networks, fusing information through latent transfer and decoding to forecast perturbation-induced expression across unseen single and combinatorial perturbations and across novel cell lines. The work presents rigorous metric design, extensive ablations, and a multi-graph benchmarking approach that demonstrates state-of-the-art performance and robust generalization, addressing prior concerns about foundation-models in perturbation biology. The framework provides a practical path toward scalable in silico perturbation prediction with potential to accelerate drug discovery, cross-context extrapolation, and personalized medicine, while outlining future directions in few-shot/active learning and expanded evaluation protocols.

Abstract

Accurately predicting cellular responses to genetic perturbations is essential for understanding disease mechanisms and designing effective therapies. Yet exhaustively exploring the space of possible perturbations (e.g., multi-gene perturbations or across tissues and cell types) is prohibitively expensive, motivating methods that can generalize to unseen conditions. In this work, we explore how knowledge graphs of gene-gene relationships can improve out-of-distribution (OOD) prediction across three challenging settings: unseen single perturbations; unseen double perturbations; and unseen cell lines. In particular, we present: (i) TxPert, a new state-of-the-art method that leverages multiple biological knowledge networks to predict transcriptional responses under OOD scenarios; (ii) an in-depth analysis demonstrating the impact of graphs, model architecture, and data on performance; and (iii) an expanded benchmarking framework that strengthens evaluation standards for perturbation modeling.

Paper Structure

This paper contains 31 sections, 22 equations, 14 figures, 1 algorithm.

Figures (14)

  • Figure 1: A) Pearson correlation of aggregated control gene expression profiles within and across experimental batches. B) Correlation between single perturbations and the mean baseline, i.e., the mean delta calculated over aggregates of essential (as defined by replogle), non-essential, or all genes ($\#\text{samples} \, \in\{2058, 7815, 9866\}$, respectively). C) Correlation between the mean baseline aggregated within or between studies and cell types. All data are CRISPRi unless marked with (oe) for overexpression data from norman2019exploring. D) Normalized retrieval between true perturbant replicates in different biological contexts. Retrieval is calculated based on the indicated expression representations and metrics ($\#\text{samples} \, = 18$). The plotted value is the 0.9 quantile (across all unique perturbants), where expected random performance is 0.9, indicated by the dashed line.
  • Figure 2: A) The TxPert architecture predicts post-perturbation gene expression by combining two modules: (1) a basal state encoder that maps batch-matched control profiles into a latent embedding, and (2) a Graph Neural Network (GNN) that learns perturbation embeddings from a gene-gene interaction graph. Perturbation embeddings are applied to the basal embedding, and the resulting latent representation is decoded to produce the predicted gene expression profile. B) OOD Perturbation effect prediction tasks for: (i) unseen single perturbations within the training cell line, (ii) novel double perturbations, where constituent singles may have been seen during training, within the training cell line, and (iii) perturbations within new cell lines not seen during training.
  • Figure 3: A) Performance of TxPert compared to GEARS and scLAMBDA on predicting unseen single perturbations within a known cell type. Horizontal bars indicate general baseline, a batch-informed model (capturing potential confounding), and experimental reproducibility (see Section \ref{['sec:method']}). B) Comparison of TxPert, GEARS and scLAMBDA in predicting double perturbation effects from known singles. C) Comparison of TxPert and scLAMBDA performance on predicting single perturbations in unseen cell lines.
  • Figure 4: Ablation studies for unseen perturbation effect prediction on K562. A) Performance of TxPert as edges of the STRINGdb graph are progressively rewired. B) Performance of TxPert (Exphormer) using individual graphs. C) Comparison of various graph integration strategies and architectures. D) Performance of TxPert (Exphormer) as multiple knowledge graphs (STRINGdb, GO, PxMap, TxMap) are subsequently integrated into the Exphormer architecture (Exphormer-MG). Horizontal bars indicate general baseline performance, the performance of a learned model making predictions on the basis of batch information (in case of confounding between batch and perturbation), and an experimental reproducibility estimate.
  • Figure 5: Investigation into strengths and weaknesses of our models. A) Breakdown of Pearson $\Delta$ by the knowledge level (Pharos rank) of the assayed genes. B) Spearman correlation between performance (Pearson $\Delta)$ and both data intrinsic factors (number of differentially expressed genes, sum of absolute $\Delta$s) and biological knowledge factors (degree of perturbed node in graph, Pharos knowledge level) metadata for unique perturbation which were hypothesized to be related to performance. C) Signed error in predicting the expression of perturbation targets, when these either are or are not targets. D) Example prediction vs ground truth for all genes when DYNC1H1 is perturbed, showing the target, DYNC1H1 in red. DYNC1H1 was chosen as an arbitrary but representative example demonstrating the common failure to predict the true down regulation of the perturbation target.
  • ...and 9 more figures