Table of Contents
Fetching ...

Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics

Egor Antipov, Alessandro Palma, Lorenzo Consoli, Stephan Günnemann, Andrea Dittadi, Fabian J. Theis

TL;DR

This work leverages condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories and shows that the method supports versatile tasks in single-cell genomics data analysis.

Abstract

Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different data-generating processes across conditions and covariates. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive flow-based evaluations are computationally expensive, as they require simulating costly likelihood integrals for each distribution separately. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.

Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics

TL;DR

This work leverages condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories and shows that the method supports versatile tasks in single-cell genomics data analysis.

Abstract

Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different data-generating processes across conditions and covariates. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive flow-based evaluations are computationally expensive, as they require simulating costly likelihood integrals for each distribution separately. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.
Paper Structure (43 sections, 1 theorem, 49 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 43 sections, 1 theorem, 49 equations, 11 figures, 8 tables, 1 algorithm.

Key Result

Proposition 4.1

Let $\bm{x}_t$ be the integral solution of the ODE with velocity $\mathrm{d}\bm{x}_t/\mathrm{d}t = b_t(\bm{x}_t)$ and $t\in[0,1]$. Moreover, let $\{p_t\}_{t\in[0,1]}$ and $\{p'_t\}_{t\in[0,1]}$ be two time-marginal probability paths, respectively generated by the vector fields $\{u_t\}_{t\in[0,1]}$ with $\log r_0(\bm{x}_0) = 0$.

Figures (11)

  • Figure 1: Naively approximating the likelihood ratio of a data point $\bm{x}_1$ under conditions $\bm{y}$ and $\bm{y}'$ requires estimating the likelihood of $\bm{x}_1$ under two conditional models $p_t^\theta(\cdot \mid \bm{y})$ and $p_t^\theta(\cdot \mid \bm{y}')$, which implies the evaluation of two integrals. With scRatio, we directly estimate the likelihood ratio as the solution of a dedicated ODE while simulating time-marginals of a sample over time.
  • Figure 2: a. The average MSE across 5 training runs between true and estimated likelihood ratios across models and simulation parameters. SB paths stand for Schrödinger Bridge probability paths. b. Runtime in seconds for likelihood ratio estimation comparing scRatio with the naive approach across simulation parameters.
  • Figure 3: Ratio-based batch correction evaluation on the NeurIPS 2021 (panel a) and C. Elegans (panel b) datasets. Left: Distribution of the absolute estimated $\log$-ratio before and after batch correction for each of the measured batch and cell type combinations in the two datasets. Right: UMAP plots calculated before and after batch correction with scVI. To compute the UMAP, we use 50-dimensional representations obtained by PCA and scVI for uncorrected and corrected data, respectively. Batch names are annotated below each plot.
  • Figure 4: On the x-axis, the mean absolute $\log$-likelihood ratio $\log r_1^\theta(\bm{x}_1 \mid (d_1, d_2), d_1)$ evaluated on cells perturbed with both $(d_1, d_2)$ and only $d_1$. On the y-axis, the mean log-odds of a classifier trained to discriminate cells treated with $(d_1, d_2)$ from cells treated with $d_1$ only. Each point is a combination. UMAP plots show extreme cases of the combinatorial effect and lack thereof.
  • Figure 5: Patient-specific differential response to perturbation. (a) Depiction of the experimental setup. Following oesinghaus2025single, we divide donors into two groups responding differently to distinct cytokine treatments. (b) Absolute $\log$-likelihood ratios between treated and control distributions evaluated on perturbed cells from different donor groups. A higher ratio indicates evidence of a strong response to a cytokine by a donor group, while $\log$-ratios around 0 refer to a lower shift from controls.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Proposition 4.1
  • Remark 4.2