A Bayesian Flow Network Framework for Chemistry Tasks

Nianze Tao; Minori Abe

A Bayesian Flow Network Framework for Chemistry Tasks

Nianze Tao, Minori Abe

TL;DR

This work introduces ChemBFN, a Bayesian flow network framework for chemistry tasks that operates on discrete data representations such as SMILES/SELFIES. By adopting a novel discrete accuracy schedule with $β(t)$ and $α(t)$, the method decouples sampling size from object length and achieves competitive generation quality with fewer steps, while enabling classifier-free guidance for conditional generation. The approach also demonstrates strong downstream predictive capability, with generative pretraining improving performance on MoleculeNet regression/classification tasks and reaction yield prediction, and shows that larger pretraining datasets do not always yield better performance. The authors release code and models publicly, highlighting the practical potential for all-in-one models in drug design, property prediction, and synthesis planning, though gaps remain compared to graph-based predictors.

Abstract

In this work, we introduce ChemBFN, a language model that handles chemistry tasks based on Bayesian flow networks working on discrete data. A new accuracy schedule is proposed to improve the sampling quality by significantly reducing the reconstruction loss. We show evidence that our method is appropriate for generating molecules with satisfied diversity even when a smaller number of sampling steps is used. A classifier-free guidance method is adapted for conditional generation. It is also worthwhile to point out that after generative training, our model can be fine-tuned on regression and classification tasks with the state-of-the-art performance, which opens the gate of building all-in-one models in a single module style. Our model has been open sourced at https://github.com/Augus1999/bayesian-flow-network-for-chemistry.

A Bayesian Flow Network Framework for Chemistry Tasks

TL;DR

and

, the method decouples sampling size from object length and achieves competitive generation quality with fewer steps, while enabling classifier-free guidance for conditional generation. The approach also demonstrates strong downstream predictive capability, with generative pretraining improving performance on MoleculeNet regression/classification tasks and reaction yield prediction, and shows that larger pretraining datasets do not always yield better performance. The authors release code and models publicly, highlighting the practical potential for all-in-one models in drug design, property prediction, and synthesis planning, though gaps remain compared to graph-based predictors.

Abstract

Paper Structure (20 sections, 5 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 20 sections, 5 equations, 6 figures, 12 tables, 1 algorithm.

Introduction
Methods
Discrete Bayesian Flow Networks
Model Architecture
A New Accuracy Schedule
Datasets and Benchmarks
Fine-tuning Strategy
Experiments and Results
Unconditional Generation
Conditional Generation of Small Molecules
Molecular Scaffold Extension
Finetuning on Prediction Tasks
Reaction Yield Prediction
Is Larger Pretrain Dataset Better?
Training Details
...and 5 more sections

Figures (6)

Figure 1: Visualised scheme of our model. The architecture is inspired by DiTdit. The multi-head self-attention layers did not use causal masking which is the same as BERTbert while we replaced the commonly used positional embedding method (absolute positional embedding used in DiT, BERT and RoBERTaroberta models) with the novel X POSxpos variation of rotary positional embeddingroformer. Note that each FFN (feed-forward network) layer adsorbs a dropout layer.
Figure 2: Comparing our accuracy schedule with quadratic accuracy schedule initialised with the same value of $\beta(1)$. (Left) Accuracy schedules $\beta(t)$. (Right) The accuracy rates $\alpha(t)$. Note that our $\beta(t)$ does not deviate too much from quadratic one, yet the rate (derivative) differs substantially as $t$ goes to 1.
Figure 3: The fine-tuning strategy of our model. The predicted label $\hat{y}\in\mathbb{R}^{n}$ is mapped by a MLP from the embedding of $\langle$start$\rangle$ token $\boldsymbol{\psi}'_{\langle{\rm start}\rangle}$ restricted by $t=1$. The MLP used here had 2 linear layers with a SELU activation function between them in a size of [512, 256, $n_{task}$]. Note that at prediction mode, the linear layer that maps latent vectors to output distributions is not activated; The conditioning is biased to null $\phi$; All $\langle$pad$\rangle$ tokens are masked out in attention.
Figure 4: Visualisation of the impact on training loss, reconstruction loss $L^{r}$ and continuous (cts) time loss $L^{\infty}$ of different accuracy schedules with different values of $\beta(1)$. $L^{r}$ and $L^{\infty}$ were computed on 1k discretised steps after training.
Figure 5: Conditioned samples on QM9. The number of sampling steps was 1k. Since QM9 exhaustively included stable small molecules made up of CHONF, only 4 conditioned samples and 5 unconditioned samples are novel.
...and 1 more figures

A Bayesian Flow Network Framework for Chemistry Tasks

TL;DR

Abstract

A Bayesian Flow Network Framework for Chemistry Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)