Diffusion-Driven Domain Adaptation for Generating 3D Molecules

Haokai Hong; Wanyu Lin; Kay Chen Tan

Diffusion-Driven Domain Adaptation for Generating 3D Molecules

Haokai Hong, Wanyu Lin, Kay Chen Tan

TL;DR

This work tackles domain adaptation in 3D molecule generation by enabling diffusion-based generators to produce valid, novel molecules in unseen target domains without collecting target-domain data. The authors introduce GADM, which combines an asymmetric Equivariant Masked Autoencoder to learn structure-level domain priors with a Domain Prior-Supervised Diffusion Model that conditions denoising on these priors while preserving $SE(3)$-equivariance. The approach yields substantial improvements over state-of-the-art diffusion baselines across scaffold- and ring-structure-domain tasks, achieving high scaffold/ring coverage and meaningful generation of rare-domain molecules (e.g., up to 82.1% of an 8-ring target in QM9 and notable results in GEOM-DRUG). Overall, GADM provides a data-efficient, controllable pathway to explore novel molecular geometries, with potential impact on drug discovery and materials science.

Abstract

Can we train a molecule generator that can generate 3D molecules from a new domain, circumventing the need to collect data? This problem can be cast as the problem of domain adaptive molecule generation. This work presents a novel and principled diffusion-based approach, called GADM, that allows shifting a generative model to desired new domains without the need to collect even a single molecule. As the domain shift is typically caused by the structure variations of molecules, e.g., scaffold variations, we leverage a designated equivariant masked autoencoder (MAE) along with various masking strategies to capture the structural-grained representations of the in-domain varieties. In particular, with an asymmetric encoder-decoder module, the MAE can generalize to unseen structure variations from the target domains. These structure variations are encoded with an equivariant encoder and treated as domain supervisors to control denoising. We show that, with these encoded structural-grained domain supervisors, GADM can generate effective molecules within the desired new domains. We conduct extensive experiments across various domain adaptation tasks over benchmarking datasets. We show that our approach can improve up to 65.6% in terms of success rate defined based on molecular validity, uniqueness, and novelty compared to alternative baselines.

Diffusion-Driven Domain Adaptation for Generating 3D Molecules

TL;DR

-equivariance. The approach yields substantial improvements over state-of-the-art diffusion baselines across scaffold- and ring-structure-domain tasks, achieving high scaffold/ring coverage and meaningful generation of rare-domain molecules (e.g., up to 82.1% of an 8-ring target in QM9 and notable results in GEOM-DRUG). Overall, GADM provides a data-efficient, controllable pathway to explore novel molecular geometries, with potential impact on drug discovery and materials science.

Abstract

Paper Structure (13 sections, 2 theorems, 10 equations, 3 figures, 5 tables)

This paper contains 13 sections, 2 theorems, 10 equations, 3 figures, 5 tables.

Introduction
Problem Setup and Preliminaries
Problem Definition
Preliminaries
Method
Equivarient Masked Autoencoder
Domain Prior-Supervised Diffusion Model
Training and Generation
Experiments
Experiment Setup
Results and Analysis
Related Work
Conclusion

Key Result

Theorem 3.1

$\mathcal{L}_{}$ is an $SE(3)$-invariant variational lower bound to the log-likelihood, i.e., for any geometries $\langle\mathbf{x}, \mathbf{h}\rangle$, we have:

Figures (3)

Figure 1: The Illustration of Proposed GADM Framework. During training (gray pipeline): I. Equivariant Masked Autoencoder (EMAE): the equivariant encoder ($\mathcal{E}$) first maps the domain prior---masked structure (i.e. scaffold/ring)--into the masked latent features. These latent features would be processed with an equivariant decoder ($\mathcal{D}$) for reconstructing the original molecule in 3D atomic space. This asymmetric encoder-decoder architecture enables to capture the in-domain priors and to generalize to out-of-domain structures. II. Domain Prior-Supervised Diffusion Model (DSDM): DSDM first diffuses the molecule into noises and then incorporates the masked latent features as domain supervisor to perform denoising for reconstructing the input molecules. During generation (red pipeline): EMAE receives the target domain prior and encodes it as the domain supervisor. Then, DSDM denoises from sampled Gaussian noise under domain supervision to generate novel and valid molecules with target structure variations.
Figure 2: The illustration of the adaptive generation process with : given a scaffold as the domain supervisor from a new domain, our trained can generate valid, unique, and novel molecules containing the target scaffold.
Figure 3: Scaffolds Proportion and Coverage.

Theorems & Definitions (2)

Theorem 3.1
Theorem 3.2

Diffusion-Driven Domain Adaptation for Generating 3D Molecules

TL;DR

Abstract

Diffusion-Driven Domain Adaptation for Generating 3D Molecules

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (2)