PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

Yan Wu; Esther Wershof; Sebastian M Schmon; Marcel Nassar; Błażej Osiński; Ridvan Eksi; Zichao Yan; Rory Stark; Kun Zhang; Thore Graepel

PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

Yan Wu, Esther Wershof, Sebastian M Schmon, Marcel Nassar, Błażej Osiński, Ridvan Eksi, Zichao Yan, Rory Stark, Kun Zhang, Thore Graepel

TL;DR

PerturbBench tackles the need for standardized benchmarking in cellular perturbation analysis by delivering a modular framework, curated datasets, and a comprehensive metric suite that includes rank-based evaluations. The study shows that no single architecture universally outperforms others across all settings, with simple baselines often scaling better as data size grows, while autoencoder-based models excel in distributional metrics; rank metrics prove essential to detect mode collapse that RMSE alone misses. Through thorough ablations and cross-dataset experiments, the work identifies key components that influence performance and demonstrates that robust benchmarking is feasible and informative for guiding model development. Collectively, PerturBench provides a valuable resource to accelerate robust in-silico perturbation screening for therapeutic discovery.

Abstract

We introduce a comprehensive framework for modeling single cell transcriptomic responses to perturbations, aimed at standardizing benchmarking in this rapidly evolving field. Our approach includes a modular and user-friendly model development and evaluation platform, a collection of diverse perturbational datasets, and a set of metrics designed to fairly compare models and dissect their performance. Through extensive evaluation of both published and baseline models across diverse datasets, we highlight the limitations of widely used models, such as mode collapse. We also demonstrate the importance of rank metrics which complement traditional model fit measures, such as RMSE, for validating model effectiveness. Notably, our results show that while no single model architecture clearly outperforms others, simpler architectures are generally competitive and scale well with larger datasets. Overall, this benchmarking exercise sets new standards for model evaluation, supports robust model development, and furthers the use of these models to simulate genetic and chemical screens for therapeutic discovery.

PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

TL;DR

Abstract

Paper Structure (77 sections, 12 equations, 6 figures, 22 tables)

This paper contains 77 sections, 12 equations, 6 figures, 22 tables.

Introduction
Related Works
Contributions
Datasets and Tasks
Perturbation Prediction Models
Modeling counterfactuals
Matched Controls
Disentanglement
Models for benchmarking
Baseline models
Linear
Latent Additive
Decoder Only
Benchmarking
Population Aggregation
...and 62 more sections

Figures (6)

Figure 1: A) Single cell perturbational datasets at multiple scales. B) Biologically relevant covariate transfer and combinatorial prediction data splits. C) Dataloaders support two training strategies: 1) control matching which involves mapping a control cell to a perturbed cell and 2) disentanglement which maps a perturbed cell to itself. D) A model zoo with modular components such as relevant baseline models, adversarial loss, perturbation sparsity, and others. E) Standardized benchmarking suite supporting flexible pipelines and metrics for evaluating models
Figure 2: Visualization of the ranking approach. We measure which perturbation in the data is closest to the predicted perturbation as measured by the closeness of their transcriptomes. In case A the rank metric for prediction X is $\mathrm{rank}(X) = \frac{0}{6} = 0$, in case B $\mathrm{rank}(Y) = \frac{4}{6} = 0.67$.
Figure 3: Scaling of cosine similarity (left) and its rank (right) with increasing size of data included in the training process ($x$-axis) for several perturbation response models. Points represent results on test data for 5 different seeds, the line represent their average.
Figure C.1: Cosine similarity of log fold changes (left) and its rank (right) of the models as a function of data balance.
Figure C.2: Cosine similarity matrix based on log-fold changes predicted, between every pair of perturbation-covariate combination in \ref{['data:Srivatsan20']} dataset. All models are hyperparameters optimised. A) DecoderOnly model with only covariates as input. B) DecoderOnly model with covariates and perturbations as input. C) CPA$^*$. D) CPA$^*$ (noAdv). E) SAMS-VAE$^*$ (S). F) true log-fold changes in the dataset. Diagnoal blocks correspond to cell lines: A549, K562, MCF-7.
...and 1 more figures

PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

TL;DR

Abstract

PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (6)