Table of Contents
Fetching ...

UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

Shikun Feng, Yuyan Ni, Yan Lu, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan

TL;DR

UniGEM addresses the challenge of jointly optimizing molecular generation and property prediction by factorizing diffusion into two phases around a nucleation time $t_n$. The model generates coordinates during growth and activates atom type and property prediction in the nucleation-aware growth phase, coupled with oversampling and a multi-branch training scheme to balance tasks. Theoretical analysis based on InfoMax and generation-error bounds supports the design, and experiments on QM9 and GEOM-Drugs show superior generation stability and competitive or superior property prediction accuracy without extra pre-training. The approach yields a unified, efficient framework with potential implications for broader AI domains beyond chemistry.

Abstract

Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision.

UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

TL;DR

UniGEM addresses the challenge of jointly optimizing molecular generation and property prediction by factorizing diffusion into two phases around a nucleation time . The model generates coordinates during growth and activates atom type and property prediction in the nucleation-aware growth phase, coupled with oversampling and a multi-branch training scheme to balance tasks. Theoretical analysis based on InfoMax and generation-error bounds supports the design, and experiments on QM9 and GEOM-Drugs show superior generation stability and competitive or superior property prediction accuracy without extra pre-training. The approach yields a unified, efficient framework with potential implications for broader AI domains beyond chemistry.

Abstract

Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision.

Paper Structure

This paper contains 38 sections, 3 theorems, 29 equations, 6 figures, 10 tables.

Key Result

Theorem 2.1

The mutual information between ${\bm{x}}_0$ and $\zeta_t$ can be expressed as follows, with a subsequent lower bound: where $q(x_0,x_t)$ are data distribution defined by the forward process of diffusion, $q(\zeta_t|{\bm{x}}_t)=\delta_{g_\theta({\bm{x}}_t)}$ and $p({\bm{x}}_0|\zeta_t)$ represent the estimated representation and denoising distributions by the denoising network. In practice, our den

Figures (6)

  • Figure 1: The two-phase generative process of UniGEM. We treat molecule generation as a two-phase problem: nucleation and growth, defining the separation time as nucleation time. Properties are only well-defined in the growth phase, so during training, property and atom type predictions are incorporated only in the growth phase.
  • Figure 2: Comparison of performance across different nucleation time selections for both the generation and property prediction tasks.
  • Figure 3: The training process of UniGEM.
  • Figure 4: The molecular generative process of UniGEM.
  • Figure 5: A visualization of the generation process of UniGEM.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 2.1
  • Theorem 2.2: Generative Error Analysis
  • Theorem G.1: Theorem2 in chen2023sampling
  • proof
  • proof