Sparse Data Diffusion for Scientific Simulations in Biology and Physics
Phil Ostheimer, Mayank Nagda, Andriy Balinskyy, Jean Radig, Carl Herrmann, Stephan Mandt, Marius Kloft, Sophie Fellenz
TL;DR
This work tackles the challenge of generating sparsely populated scientific data where exact zeros carry physical meaning. It introduces Sparse Data Diffusion (SDD), a diffusion framework that jointly diffuses discrete Sparsity Bits and continuous values to enforce exact zeros during sampling. Across calorimeter images, single-cell RNA sequencing data, and sparse images, SDD achieves higher fidelity in recovering ground-truth sparsity patterns and reproduces domain-relevant structures more faithfully than standard diffusion baselines and some domain-specific models. The method offers a scalable, physically grounded approach to generative modeling in scientific simulations and can be extended to other generative paradigms such as GANs, VAEs, and normalizing flows to broaden its impact.
Abstract
Sparse data is fundamental to scientific simulations in biology and physics, from single-cell gene expression to particle calorimetry, where exact zeros encode physical absence rather than weak signal. However, existing diffusion models lack the physical rigor to faithfully represent this sparsity. This work introduces Sparse Data Diffusion (SDD), a generative method that explicitly models exact zeros via Sparsity Bits, unifying efficient ML generation with physically grounded sparsity handling. Empirical validation in particle physics and single-cell biology demonstrates that SDD achieves higher fidelity than baseline methods in capturing sparse patterns critical for scientific analysis, advancing scalable and physically faithful simulation.
