Table of Contents
Fetching ...

Bridging the Gap between Learning and Inference for Diffusion-Based Molecule Generation

Peidong Liu, Wenbo Zhang, Xue Zhe, Jiancheng Lv, Xianggen Liu

TL;DR

The core idea of GapDiff is to utilize model-predicted conformations as ground truth probabilistically during training, aiming to mitigate the data distributional disparity between training and inference, thereby enhancing the affinity of generated molecules.

Abstract

The efficacy of diffusion models in generating a spectrum of data modalities, including images, text, and videos, has spurred inquiries into their utility in molecular generation, yielding significant advancements in the field. However, the molecular generation process with diffusion models involves multiple autoregressive steps over a finite time horizon, leading to exposure bias issues inherently. To address the exposure bias issue, we propose a training framework named GapDiff. The core idea of GapDiff is to utilize model-predicted conformations as ground truth probabilistically during training, aiming to mitigate the data distributional disparity between training and inference, thereby enhancing the affinity of generated molecules. We conduct experiments using a 3D molecular generation model on the CrossDocked2020 dataset, and the vina energy and diversity demonstrate the potency of our framework with superior affinity. GapDiff is available at \url{https://github.com/HUGHNew/gapdiff}.

Bridging the Gap between Learning and Inference for Diffusion-Based Molecule Generation

TL;DR

The core idea of GapDiff is to utilize model-predicted conformations as ground truth probabilistically during training, aiming to mitigate the data distributional disparity between training and inference, thereby enhancing the affinity of generated molecules.

Abstract

The efficacy of diffusion models in generating a spectrum of data modalities, including images, text, and videos, has spurred inquiries into their utility in molecular generation, yielding significant advancements in the field. However, the molecular generation process with diffusion models involves multiple autoregressive steps over a finite time horizon, leading to exposure bias issues inherently. To address the exposure bias issue, we propose a training framework named GapDiff. The core idea of GapDiff is to utilize model-predicted conformations as ground truth probabilistically during training, aiming to mitigate the data distributional disparity between training and inference, thereby enhancing the affinity of generated molecules. We conduct experiments using a 3D molecular generation model on the CrossDocked2020 dataset, and the vina energy and diversity demonstrate the potency of our framework with superior affinity. GapDiff is available at \url{https://github.com/HUGHNew/gapdiff}.

Paper Structure

This paper contains 29 sections, 13 equations, 5 figures, 4 tables, 3 algorithms.

Figures (5)

  • Figure 1: The overview of GapDiff pipeline with oracle conformation. Our diffusion process is consistent with the logic of DDPM, with the difference lying in the reverse process (the lower part indicated by the green arrow) during training. In the reverse process, the ground truth is selected probabilistically between the original ground truth (denoted as $x_i$ like $x_t$) and the model's real-time predicted value (denoted as $x^{oracle}_i$ like $x^{oracle}_t$), with a probability $p_T$ favoring the original value. The starting point of the reverse process is a random conformation, such as $x_T$ corresponding to a group of atoms with bonds, which serves as the initial random noise and cannot be reconstructed into a feasible molecule. In subsequent time steps, most of the noise is removed with the protein $\mathcal{P}$ condition, and we can observe changes in the molecular conformation within the protein pocket. By the final steps, the conformation becomes mostly stable.
  • Figure 2: The proposed sampling strategy of GapDiff. We use arrows of different colors to distinguish between the classic method and ours.
  • Figure 3: Comparing the distribution for distances of all-atom for reference molecules in the test set (gray) and generated molecules (color). Jensen-Shannon divergence (JSD$\downarrow$) between two distributions is reported.
  • Figure 4: Median Vina Dock energy for five models across 100 testing targets. The percentage represents the proportion of the model achieving the best binding affinity on the test set.
  • Figure 5: (a) is the original annealing comparison. (b) is the arc annealing comparison. And (c) is the comparison of the curves.