Table of Contents
Fetching ...

CFM-GP: Unified Conditional Flow Matching to Learn Gene Perturbation Across Cell Types

Abrar Rahman Abir, Sajib Acharjee Dip, Liqing Zhang

TL;DR

The robustness and biological fidelity of CFM-GP as a scalable solution for cross-cell type gene perturbation prediction are demonstrated, and it is demonstrated that it consistently outperforms state-of-the-art baselines in R-squared and Spearman correlation.

Abstract

Understanding gene perturbation effects across diverse cellular contexts is a central challenge in functional genomics, with important implications for therapeutic discovery and precision medicine. Single-cell technologies enable high-resolution measurement of transcriptional responses, but collecting such data is costly and time-consuming, especially when repeated for each cell type. Existing computational methods often require separate models per cell type, limiting scalability and generalization. We present CFM-GP, a method for cell type-agnostic gene perturbation prediction. CFM-GP learns a continuous, time-dependent transformation between unperturbed and perturbed gene expression distributions, conditioned on cell type, allowing a single model to predict across all cell types. Unlike prior approaches that use discrete modeling, CFM-GP employs a flow matching objective to capture perturbation dynamics in a scalable manner. We evaluate on five datasets: SARS-CoV-2 infection, IFN-beta stimulated PBMCs, glioblastoma treated with Panobinostat, lupus under IFN-beta stimulation, and Statefate progenitor fate mapping. CFM-GP consistently outperforms state-of-the-art baselines in R-squared and Spearman correlation, and pathway enrichment analysis confirms recovery of key biological pathways. These results demonstrate the robustness and biological fidelity of CFM-GP as a scalable solution for cross-cell type gene perturbation prediction.

CFM-GP: Unified Conditional Flow Matching to Learn Gene Perturbation Across Cell Types

TL;DR

The robustness and biological fidelity of CFM-GP as a scalable solution for cross-cell type gene perturbation prediction are demonstrated, and it is demonstrated that it consistently outperforms state-of-the-art baselines in R-squared and Spearman correlation.

Abstract

Understanding gene perturbation effects across diverse cellular contexts is a central challenge in functional genomics, with important implications for therapeutic discovery and precision medicine. Single-cell technologies enable high-resolution measurement of transcriptional responses, but collecting such data is costly and time-consuming, especially when repeated for each cell type. Existing computational methods often require separate models per cell type, limiting scalability and generalization. We present CFM-GP, a method for cell type-agnostic gene perturbation prediction. CFM-GP learns a continuous, time-dependent transformation between unperturbed and perturbed gene expression distributions, conditioned on cell type, allowing a single model to predict across all cell types. Unlike prior approaches that use discrete modeling, CFM-GP employs a flow matching objective to capture perturbation dynamics in a scalable manner. We evaluate on five datasets: SARS-CoV-2 infection, IFN-beta stimulated PBMCs, glioblastoma treated with Panobinostat, lupus under IFN-beta stimulation, and Statefate progenitor fate mapping. CFM-GP consistently outperforms state-of-the-art baselines in R-squared and Spearman correlation, and pathway enrichment analysis confirms recovery of key biological pathways. These results demonstrate the robustness and biological fidelity of CFM-GP as a scalable solution for cross-cell type gene perturbation prediction.

Paper Structure

This paper contains 29 sections, 6 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: Overview of the CFM-GP modeling framework.(a) During training, the model receives paired gene expression profiles of individual cells in control and perturbed conditions, along with their associated cell type labels. (b) A conditional vector field network $\mathbf{v}_\theta(\mathbf{x}(t), t \mid \mathbf{x}_c, c)$ is parameterized by a neural network, where each time step receives the interpolated state $\mathbf{x}(t)$, original control state $\mathbf{x}_c$, cell type embedding $c$, and time embedding $t$. (c) Conditional Flow Matching drives the learning of continuous trajectories between the control and perturbed states by matching the learned vector field to the ground-truth direction. (d) The model is trained to minimize the mean squared error between the predicted and true velocity vectors across random interpolations between control and perturbed states. (e) At inference, the trained vector field is integrated as an ODE from $t = 0$ to $t = 1$, starting from a new control profile $\mathbf{x}_c$ and cell type $c$, to produce a predicted perturbed profile $\hat{\mathbf{y}}$.
  • Figure 2: UMAP visualizations illustrating CFM-GP’s ability to preserve cell population structure and accurately reproduce perturbation-induced distribution shifts across multiple datasets. (a) UMAP visualization of ground-truth gene expression profiles across five datasets (COVID, PBMC, Glioblastoma, Lupus, Statefate), showing distinct cell population structures. (b) UMAP visualization of CFM-GP–predicted gene expression profiles for the same datasets, preserving the clustering patterns observed in the ground truth. (c) UMAP comparison of control vs. real perturbed cells (left) and control vs. predicted perturbed cells (right), showing that CFM-GP predictions closely match the ground-truth perturbation distribution.
  • Figure 3: Comparison of predictive performance (R2) across five datasets—COVID-19, PBMC, Glioblastoma, Lupus, and Statefate—between CFM-GP and existing baseline models. Panel (a) shows absolute R2 values per model and cell type. Panel (b) presents a heatmap of R2 values across all datasets and cell types.Panel (c) illustrates the performance improvement of CFM-GP relative to CoupleVAE ($\Delta R^2$).
  • Figure 4: Comparison of distributional similarity between predicted and real gene expression across models using Maximum Mean Discrepancy (MMD). (a) Boxplots showing MMD values for each model across five datasets (COVID, PBMC, Glioblastoma, Lupus, Statefate); lower values indicate better alignment. (b) Heatmaps of MMD scores per model and cell type, highlighting CFM-GP’s consistent performance across diverse biological settings. (c) Bar plots showing average MMD improvement ($\Delta$MMD) of each model relative to CoupleVAE, with CFM-GP consistently achieving the largest reductions in distributional divergence.
  • Figure 5: Evaluation of gene ranking preservation using Spearman correlation.(a) Radar plots showing average Spearman correlation ($\rho$) across cell types for each model. (b) Heatmap of Spearman $\rho$ values across models and cell types, where higher values indicate better consistency in predicted gene expression rankings. (c) Mean improvement in Spearman $\rho$ of each model over the baseline (CoupleVAE), evaluated using paired t-tests across cell types.
  • ...and 3 more figures