Table of Contents
Fetching ...

Causal Cellular Context Transfer Learning (C3TL): An Efficient Architecture for Prediction of Unseen Perturbation Effects

Michael Scholkemper, Sach Mukherjee

Abstract

Predicting the effects of chemical and genetic perturbations on quantitative cell states is a central challenge in computational biology, molecular medicine and drug discovery. Recent work has leveraged large-scale single-cell data and massive foundation models to address this task. However, such computational resources and extensive datasets are not always accessible in academic or clinical settings, hence limiting utility. Here we propose a lightweight framework for perturbation effect prediction that exploits the structured nature of biological interventions and specific inductive biases/invariances. Our approach leverages available information concerning perturbation effects to allow generalization to novel contexts and requires only widely-available bulk molecular data. Extensive testing, comparing predictions of context-specific perturbation effects against real, large-scale interventional experiments, demonstrates accurate prediction in new contexts. The proposed approach is competitive with SOTA foundation models but requires simpler data, much smaller model sizes and less time. Focusing on robust bulk signals and efficient architectures, we show that accurate prediction of perturbation effects is possible without proprietary hardware or very large models, hence opening up ways to leverage causal learning approaches in biomedicine generally.

Causal Cellular Context Transfer Learning (C3TL): An Efficient Architecture for Prediction of Unseen Perturbation Effects

Abstract

Predicting the effects of chemical and genetic perturbations on quantitative cell states is a central challenge in computational biology, molecular medicine and drug discovery. Recent work has leveraged large-scale single-cell data and massive foundation models to address this task. However, such computational resources and extensive datasets are not always accessible in academic or clinical settings, hence limiting utility. Here we propose a lightweight framework for perturbation effect prediction that exploits the structured nature of biological interventions and specific inductive biases/invariances. Our approach leverages available information concerning perturbation effects to allow generalization to novel contexts and requires only widely-available bulk molecular data. Extensive testing, comparing predictions of context-specific perturbation effects against real, large-scale interventional experiments, demonstrates accurate prediction in new contexts. The proposed approach is competitive with SOTA foundation models but requires simpler data, much smaller model sizes and less time. Focusing on robust bulk signals and efficient architectures, we show that accurate prediction of perturbation effects is possible without proprietary hardware or very large models, hence opening up ways to leverage causal learning approaches in biomedicine generally.
Paper Structure (12 sections, 1 theorem, 20 equations, 3 figures, 2 tables)

This paper contains 12 sections, 1 theorem, 20 equations, 3 figures, 2 tables.

Key Result

Proposition 3.4

$Y = \mathcal{T}(c,p)$ is a global minimum of the $\ell_2$ loss

Figures (3)

  • Figure 1: Proposed architecture. The model compresses the high-dimensional gene expression input $\{x^\gamma_p\}_\gamma$ from multiple contexts (indexed by $\gamma$) but under the same perturbation $p$ into a lower-dimensional latent representation $\hat{z}_p$ of the perturbation itself. Similarly, it forms a representation $\hat{\psi}_c$ of the context from all the different perturbations (indexed by $\pi$) observed in the specific context $c$. From these latent representations, it then reconstructs the original gene expression $\hat{x}^c_p$ using a decoder $\Gamma$.
  • Figure 2: Scatter plot of model predictions. Comparison of model outputs versus true target values for the Tahoe-100 dataset. The dashed line represents perfect prediction.
  • Figure 3: Sensitivity to data scarcity. Performance evaluation under varying training data availability. Panels (left to right) display regimes with 43, 20, 10, and 5 randomly sampled training contexts. The x-axis represents the number of interventions in the test context available for adaptation (implying the complement was used for testing). This ticks correspond to (from left to right) $80\%$, $50\%$, $30\%$, $20\%$, $10\%$, $5\%$, $1\%$ of the data available in the test contexts in total. The y-axis displays the Pearson Correlation between prediction and target ($\pm$ standard deviation across 5 test contexts). In all experiments, metrics are only computed on interventions that have not been seen by the models in the target context.

Theorems & Definitions (2)

  • Remark 3.3
  • Proposition 3.4