Table of Contents
Fetching ...

Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces

Nianze Tao

TL;DR

This work tackles the challenge of generating out-of-distribution molecules with high-property targets by leveraging Bayesian Flow Networks (BFN), particularly the ChemBFN variant, as a natural OOD sampler. It introduces semi-autoregressive training to enhance generation and introduces two acceleration techniques: an online RL term and an ODE-like sampling process in latent space with temperature scaling $\tau$, plus a semi-autoregressive (SAR) masking approach that enables four training/sampling strategies. The approach is demonstrated across small-molecule benchmarks (MOSES, GuacaMol, ZINC250k) and protein sequences, showing improved validity, novelty, and high-property out-of-distribution sampling, including conditional generation guided by property vectors $\mathbf{y}=(QED,SA,DS)$ and achieving competitive or superior novel hits and docking scores relative to SOTA models. The findings suggest that BFN-based OOD sampling, combined with SAR and guided generation, provides a practical, scalable path for de novo drug design and exploring large chemical spaces while maintaining realistic validity and naturalness of the generated structures.

Abstract

Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for ${de~novo}$ drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network is capable of effortlessly generating high quality out-of-distribution samples that meet several scenarios. We introduce a semi-autoregressive training/sampling method that helps to enhance the model performance and surpass the state-of-the-art models.

Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces

TL;DR

This work tackles the challenge of generating out-of-distribution molecules with high-property targets by leveraging Bayesian Flow Networks (BFN), particularly the ChemBFN variant, as a natural OOD sampler. It introduces semi-autoregressive training to enhance generation and introduces two acceleration techniques: an online RL term and an ODE-like sampling process in latent space with temperature scaling , plus a semi-autoregressive (SAR) masking approach that enables four training/sampling strategies. The approach is demonstrated across small-molecule benchmarks (MOSES, GuacaMol, ZINC250k) and protein sequences, showing improved validity, novelty, and high-property out-of-distribution sampling, including conditional generation guided by property vectors and achieving competitive or superior novel hits and docking scores relative to SOTA models. The findings suggest that BFN-based OOD sampling, combined with SAR and guided generation, provides a practical, scalable path for de novo drug design and exploring large chemical spaces while maintaining realistic validity and naturalness of the generated structures.

Abstract

Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network is capable of effortlessly generating high quality out-of-distribution samples that meet several scenarios. We introduce a semi-autoregressive training/sampling method that helps to enhance the model performance and surpass the state-of-the-art models.

Paper Structure

This paper contains 18 sections, 3 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visualisation of bidirectional, autoregressive, and semi-autoregressive token update methods.
  • Figure 2: Visualisation of MOSES benchmark metrics of different strategies. We reported Valid$\times$Novel values instead of validity and novelty separately.
  • Figure 3: UMAP visualisation of the training space of ZINC250k dataset and the unconditionally generated sample spaces of different strategies.
  • Figure 4: UMAP visualisation of the training space of ZINC250k dataset and the conditionally generated sample spaces of different strategies.
  • Figure 5: FCD values of unconditional and conditional samples of different strategies.
  • ...and 4 more figures