Table of Contents
Fetching ...

Sparse Data Diffusion for Scientific Simulations in Biology and Physics

Phil Ostheimer, Mayank Nagda, Andriy Balinskyy, Jean Radig, Carl Herrmann, Stephan Mandt, Marius Kloft, Sophie Fellenz

TL;DR

This work tackles the challenge of generating sparsely populated scientific data where exact zeros carry physical meaning. It introduces Sparse Data Diffusion (SDD), a diffusion framework that jointly diffuses discrete Sparsity Bits and continuous values to enforce exact zeros during sampling. Across calorimeter images, single-cell RNA sequencing data, and sparse images, SDD achieves higher fidelity in recovering ground-truth sparsity patterns and reproduces domain-relevant structures more faithfully than standard diffusion baselines and some domain-specific models. The method offers a scalable, physically grounded approach to generative modeling in scientific simulations and can be extended to other generative paradigms such as GANs, VAEs, and normalizing flows to broaden its impact.

Abstract

Sparse data is fundamental to scientific simulations in biology and physics, from single-cell gene expression to particle calorimetry, where exact zeros encode physical absence rather than weak signal. However, existing diffusion models lack the physical rigor to faithfully represent this sparsity. This work introduces Sparse Data Diffusion (SDD), a generative method that explicitly models exact zeros via Sparsity Bits, unifying efficient ML generation with physically grounded sparsity handling. Empirical validation in particle physics and single-cell biology demonstrates that SDD achieves higher fidelity than baseline methods in capturing sparse patterns critical for scientific analysis, advancing scalable and physically faithful simulation.

Sparse Data Diffusion for Scientific Simulations in Biology and Physics

TL;DR

This work tackles the challenge of generating sparsely populated scientific data where exact zeros carry physical meaning. It introduces Sparse Data Diffusion (SDD), a diffusion framework that jointly diffuses discrete Sparsity Bits and continuous values to enforce exact zeros during sampling. Across calorimeter images, single-cell RNA sequencing data, and sparse images, SDD achieves higher fidelity in recovering ground-truth sparsity patterns and reproduces domain-relevant structures more faithfully than standard diffusion baselines and some domain-specific models. The method offers a scalable, physically grounded approach to generative modeling in scientific simulations and can be extended to other generative paradigms such as GANs, VAEs, and normalizing flows to broaden its impact.

Abstract

Sparse data is fundamental to scientific simulations in biology and physics, from single-cell gene expression to particle calorimetry, where exact zeros encode physical absence rather than weak signal. However, existing diffusion models lack the physical rigor to faithfully represent this sparsity. This work introduces Sparse Data Diffusion (SDD), a generative method that explicitly models exact zeros via Sparsity Bits, unifying efficient ML generation with physically grounded sparsity handling. Empirical validation in particle physics and single-cell biology demonstrates that SDD achieves higher fidelity than baseline methods in capturing sparse patterns critical for scientific analysis, advancing scalable and physically faithful simulation.

Paper Structure

This paper contains 39 sections, 8 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: Calorimeter images from the muon isolation study: signal images (left) and background images (right). Rows show samples from (1) real data, (2) SDD (ours), (3) DDIM, (4) DDIM with post-hoc thresholding to match dataset sparsity (DDIM-T), and (5) the domain-specific SARM baseline. Pixel intensity (GeV) visualizes energy deposition per cell (white=zero). SDD uniquely recovers the distinct, sparse, and clustered patterns characteristic of real data, whereas DDIM completely misses the sparsity, and DDIM-T and SARM show unrealistic isolated energy deposits.
  • Figure 2: Shown is an illustration of our method SDD. Compared to other diffusion models, we expand the continuous input by discrete Sparsity Bits for forward diffusion and use them in backward diffusion and sampling to enforce sparsity in the data.
  • Figure 3: Sparsity distribution of real and generated data. Histograms (20 bins) and average sparsity levels (dashed lines) compare real data to samples from DDIM, DDIM-T, SARM, scDiffusion, and SDD. DDIM and scDiffusion underestimate sparsity; DDIM-T matches the average but lacks diversity and overshoots at high sparsity. SARM also underestimates sparsity, while SDD accurately matches both average value and distribution.
  • Figure 4: Shown are the average calorimeter images for Muon Signal and Muon Background. DDIM and DDIM-T fail to generate realistic data, while SDD succeeds. Linear scale to reveal the signal and background differences.
  • Figure 5: Shown are two-dimensional UMAPs of Tabula Muris and Human Lung Pulmonary Fibrosis, and the respective DDIM, DDIM-T, and SDD generated samples. DDIM, DDIM-T, and the domain-specific scDiffusion show little to no overlap with the respective dataset while SDD shows significant overlap, demonstrating that SDD more accurately captures the underlying structure and diversity of the real data distributions.
  • ...and 7 more figures