Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces
Nianze Tao
TL;DR
This work tackles the challenge of generating out-of-distribution molecules with high-property targets by leveraging Bayesian Flow Networks (BFN), particularly the ChemBFN variant, as a natural OOD sampler. It introduces semi-autoregressive training to enhance generation and introduces two acceleration techniques: an online RL term and an ODE-like sampling process in latent space with temperature scaling $\tau$, plus a semi-autoregressive (SAR) masking approach that enables four training/sampling strategies. The approach is demonstrated across small-molecule benchmarks (MOSES, GuacaMol, ZINC250k) and protein sequences, showing improved validity, novelty, and high-property out-of-distribution sampling, including conditional generation guided by property vectors $\mathbf{y}=(QED,SA,DS)$ and achieving competitive or superior novel hits and docking scores relative to SOTA models. The findings suggest that BFN-based OOD sampling, combined with SAR and guided generation, provides a practical, scalable path for de novo drug design and exploring large chemical spaces while maintaining realistic validity and naturalness of the generated structures.
Abstract
Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for ${de~novo}$ drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network is capable of effortlessly generating high quality out-of-distribution samples that meet several scenarios. We introduce a semi-autoregressive training/sampling method that helps to enhance the model performance and surpass the state-of-the-art models.
