Table of Contents
Fetching ...

Count Bridges enable Modeling and Deconvolving Transcriptomic Data

Nic Fishman, Gokul Gowri, Tanush Kumar, Jiaqi Lu, Valentin de Bortoli, Jonathan S. Gootenberg, Omar Abudayyeh

TL;DR

This work introduces Count Bridges, a stochastic bridge process on the integers that provides an exact, tractable analogue of diffusion-style models for count data, with closed-form conditionals for efficient training and sampling, and extends this framework to enable direct training from aggregated measurements via an Expectation-Maximization-style approach.

Abstract

Many modern biological assays, including RNA sequencing, yield integer-valued counts that reflect the number of molecules detected. These measurements are often not at the desired resolution: while the unit of interest is typically a single cell, many measurement technologies produce counts aggregated over sets of cells. Although recent generative frameworks such as diffusion and flow matching have been extended to non-Euclidean and discrete settings, it remains unclear how best to model integer-valued data or how to systematically deconvolve aggregated observations. We introduce Count Bridges, a stochastic bridge process on the integers that provides an exact, tractable analogue of diffusion-style models for count data, with closed-form conditionals for efficient training and sampling. We extend this framework to enable direct training from aggregated measurements via an Expectation-Maximization-style approach that treats unit-level counts as latent variables. We demonstrate state-of-the-art performance on integer distribution matching benchmarks, comparing against flow matching and discrete flow matching baselines across various metrics. We then apply Count Bridges to two large-scale problems in biology: modeling single-cell gene expression data at the nucleotide resolution, with applications to deconvolving bulk RNA-seq, and resolving multicellular spatial transcriptomic spots into single-cell count profiles. Our methods offer a principled foundation for generative modeling and deconvolution of biological count data across scales and modalities.

Count Bridges enable Modeling and Deconvolving Transcriptomic Data

TL;DR

This work introduces Count Bridges, a stochastic bridge process on the integers that provides an exact, tractable analogue of diffusion-style models for count data, with closed-form conditionals for efficient training and sampling, and extends this framework to enable direct training from aggregated measurements via an Expectation-Maximization-style approach.

Abstract

Many modern biological assays, including RNA sequencing, yield integer-valued counts that reflect the number of molecules detected. These measurements are often not at the desired resolution: while the unit of interest is typically a single cell, many measurement technologies produce counts aggregated over sets of cells. Although recent generative frameworks such as diffusion and flow matching have been extended to non-Euclidean and discrete settings, it remains unclear how best to model integer-valued data or how to systematically deconvolve aggregated observations. We introduce Count Bridges, a stochastic bridge process on the integers that provides an exact, tractable analogue of diffusion-style models for count data, with closed-form conditionals for efficient training and sampling. We extend this framework to enable direct training from aggregated measurements via an Expectation-Maximization-style approach that treats unit-level counts as latent variables. We demonstrate state-of-the-art performance on integer distribution matching benchmarks, comparing against flow matching and discrete flow matching baselines across various metrics. We then apply Count Bridges to two large-scale problems in biology: modeling single-cell gene expression data at the nucleotide resolution, with applications to deconvolving bulk RNA-seq, and resolving multicellular spatial transcriptomic spots into single-cell count profiles. Our methods offer a principled foundation for generative modeling and deconvolution of biological count data across scales and modalities.
Paper Structure (106 sections, 17 theorems, 141 equations, 10 figures, 14 tables, 7 algorithms)

This paper contains 106 sections, 17 theorems, 141 equations, 10 figures, 14 tables, 7 algorithms.

Key Result

Proposition 2.1

Let $(X_t)_{t\in[0,1]}$ be given by equation eq:unconditional_forward_euclidean. For $0 < s < t \le 1$, consider $(X_s)_{s\in[0,t]}$ conditioned on $X_t = x_t$ and $X_0 = x_0$. Then the conditional law $K_{s\mid 0,t}(\cdot\mid x_0,x_t)$ is Gaussian and can be written where $Z \sim \mathcal{N}(0,\mathrm{Id})$ is independent of $(X_0,X_t)$ and $r({s,t}) = {\frac{\alpha(t)^2 \sigma(s)^2}{\alpha(s)^2

Figures (10)

  • Figure 1: Left: Sample paths for several endpoint gaps $d_1$ (top). Fixing the prefix $[0,t]$ resample $(t,1]$ by the recursive kernel (bottom). Middle: Bessel slack posteriors at initial and intermediate times. The slack $M_t$ concentrates near $0$ as $|d|$ grows. Right: ECDFs of $X_s$ from a one–step kernel $(1\!\to\!s)$ and a two–step kernel $(1\!\to\!t\!\to\!s)$ are indistinguishable, confirming composition.
  • Figure 1: Training Poisson–BD Bridge
  • Figure 2: A scaled and rounded variant of the classic 8 gaussian to two moons task. Here we compare the trajectories of continuous flow matching, discrete flow matching, and count bridges. CB achieves the lowest $W_2$, MMD, and EMD, see Table \ref{['tab:discrete_moons']}.
  • Figure 3: Guided Sampling to for $x_0^\approx$
  • Figure 3: CFM, DFM, and CB on our low-rank mixture of Gaussians transport experiment across dimensions and NFE. See App. \ref{['app:lowrank-gaussian']} for full details.
  • ...and 5 more figures

Theorems & Definitions (32)

  • Proposition 2.1
  • Proposition 3.1
  • Proposition 4.1: First–order aggregate projection
  • Proposition A.1: Total jumps process and time-rescaling invariance
  • proof
  • Lemma A.2: Binomial--Hypergeometric structure of Count Bridges
  • Lemma A.3: Binomial composition
  • proof
  • Lemma A.4: Hypergeometric composition
  • proof
  • ...and 22 more