Three Forms of Stochastic Injection for Improved Distribution-to-Distribution Generative Modeling
Shiye Su, Yuhui Zhang, Linqi Zhou, Rajesh Ranganath, Serena Yeung-Levy
TL;DR
The paper tackles distribution-to-distribution learning when both source and target distributions are learned from unpaired data, identifying data sparsity as a core bottleneck for standard flow matching. It introduces three stochastic injections—two-stage transfer learning, perturbing the source with Gaussian noise, and perturbing the interpolant with stochastic noise—along with stochastic interpolants and a VAE-based latent model to densify supervision. Across five high-dimensional imaging datasets, the method delivers substantial gains, averaging $13$ FID points over deterministic flow matching and $9$ points over baselines, while also reducing transport costs and improving source-target alignment. This approach makes flow matching a practical and scalable tool for simulating scientifically meaningful distribution transformations in biology, medicine, astronomy, and beyond.
Abstract
Modeling transformations between arbitrary data distributions is a fundamental scientific challenge, arising in applications like drug discovery and evolutionary simulation. While flow matching offers a natural framework for this task, its use has thus far primarily focused on the noise-to-data setting, while its application in the general distribution-to-distribution setting is underexplored. We find that in the latter case, where the source is also a data distribution to be learned from limited samples, standard flow matching fails due to sparse supervision. To address this, we propose a simple and computationally efficient method that injects stochasticity into the training process by perturbing source samples and flow interpolants. On five diverse imaging tasks spanning biology, radiology, and astronomy, our method significantly improves generation quality, outperforming existing baselines by an average of 9 FID points. Our approach also reduces the transport cost between input and generated samples to better highlight the true effect of the transformation, making flow matching a more practical tool for simulating the diverse distribution transformations that arise in science.
