Table of Contents
Fetching ...

Chemistry-Enhanced Diffusion-Based Framework for Small-to-Large Molecular Conformation Generation

Yifei Zhu, Jiahui Zhang, Jiawei Peng, Mengge Li, Chao Xu, Zhenggang Lan

TL;DR

StoL introduces a chemistry-enhanced diffusion framework that builds large-molecule 3D conformations from SMILES via fragmentation, fragment-level diffusion generation, and chemistry-constrained assembly. The method avoids requiring large-molecule training data by learning on small fragments and embedding chemical priors through a two-phase training regime (data-driven followed by chemistry-enhanced) that uses Sinkhorn soft matching, Gumbel-Softmax hard matching, and planarity constraints. Empirical results on Vorinostat and the StoL25-init dataset show broader conformational coverage and the ability to discover energetically favorable minima after DFT refinement, outperforming purely data-driven or RDKit-based approaches in diversity and quality. By enabling end-to-end generation with data efficiency and transferability, StoL demonstrates a practical pathway for scalable conformer generation that integrates physical chemistry knowledge into diffusion-based models, with public data and code to support further research.

Abstract

Obtaining 3D conformations of realistic polyatomic molecules at the quantum chemistry level remains challenging, and although recent machine learning advances offer promise, predicting large-molecule structures still requires substantial computational effort. Here, we introduce StoL, a diffusion model-based framework that enables rapid and knowledge-free generation of large molecular structures from small-molecule data. Remarkably, StoL assembles molecules in a LEGO-style fashion from scratch, without seeing the target molecules or any structures of comparable size during training. Given a SMILES input, it decomposes the molecule into chemically valid fragments, generates their 3D structures with a diffusion model trained on small molecules, and assembles them into diverse conformations. This fragment-based strategy eliminates the need for large-molecule training data while maintaining high scalability and transferability. By embedding chemical principles into key steps, StoL ensures faster convergence, chemically rational structures, and broad configurational coverage, as confirmed against DFT calculations.

Chemistry-Enhanced Diffusion-Based Framework for Small-to-Large Molecular Conformation Generation

TL;DR

StoL introduces a chemistry-enhanced diffusion framework that builds large-molecule 3D conformations from SMILES via fragmentation, fragment-level diffusion generation, and chemistry-constrained assembly. The method avoids requiring large-molecule training data by learning on small fragments and embedding chemical priors through a two-phase training regime (data-driven followed by chemistry-enhanced) that uses Sinkhorn soft matching, Gumbel-Softmax hard matching, and planarity constraints. Empirical results on Vorinostat and the StoL25-init dataset show broader conformational coverage and the ability to discover energetically favorable minima after DFT refinement, outperforming purely data-driven or RDKit-based approaches in diversity and quality. By enabling end-to-end generation with data efficiency and transferability, StoL demonstrates a practical pathway for scalable conformer generation that integrates physical chemistry knowledge into diffusion-based models, with public data and code to support further research.

Abstract

Obtaining 3D conformations of realistic polyatomic molecules at the quantum chemistry level remains challenging, and although recent machine learning advances offer promise, predicting large-molecule structures still requires substantial computational effort. Here, we introduce StoL, a diffusion model-based framework that enables rapid and knowledge-free generation of large molecular structures from small-molecule data. Remarkably, StoL assembles molecules in a LEGO-style fashion from scratch, without seeing the target molecules or any structures of comparable size during training. Given a SMILES input, it decomposes the molecule into chemically valid fragments, generates their 3D structures with a diffusion model trained on small molecules, and assembles them into diverse conformations. This fragment-based strategy eliminates the need for large-molecule training data while maintaining high scalability and transferability. By embedding chemical principles into key steps, StoL ensures faster convergence, chemically rational structures, and broad configurational coverage, as confirmed against DFT calculations.

Paper Structure

This paper contains 24 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: (a) Input SMILES of the target molecule. (b) Fragmentation into smaller components. (c) Diffusion-based generation using denoising models. (d) Fragment assembly to reconstruct the target molecule, followed by chemoinformatics check. All molecular geometries were visualized via PyMOL. pymol
  • Figure 2: (a) Schematic illustration of the standard diffusion model. (b) Diagram of the chemistry-enhanced training strategy. (c) Basic validation procedures for fragment generation. (d) Chemoinformatics-based validation of assembled molecular structures.
  • Figure 3: Illustrative example of the StoL process applied to vorinostat. (a) SMILES representation and molecular structure of vorinostat. (b) Fragmentation of the molecule with corresponding fragment SMILES. (c) For fragment 1, dimensionality reduction followed by clustering techniques yields 10 representative conformations. (d) Final assembly of vorinostat, where six representative 3D structures passed cheminformatics validation check are shown. All molecular geometries were visualized via PyMOL. pymol
  • Figure 4: DFT evaluation of vorinostat conformations. (a) 29 representative conformations derived from DFT calculations using StoL-generated initial guesses, arranged in descending order of energy, with structures highlighted in light green indicating those obtained from both RDKit and StoL initial guesses. (b) Box plot of relative energies ($\Delta E$) referenced to the lowest-energy structure, comparing conformers generated from RDKit+DFT (StoL25-init dataset) and StoL+DFT (DFT calculation based on StoL-generated initial structures). (c) Comparison of the lowest-energy conformers obtained by the two methods, showing that the StoL+DFT conformer is more stable, with an energy 1.88 kcal/mol lower than the RDKit+DFT conformer. All molecular geometries were visualized via PyMOL. pymol
  • Figure 5: (a) Conformer distribution in the StoL25-init dataset. Each structures was obtained by DFT optimization of 10 RDKit-generated initial guesses. (b) Training loss curves comparing the P-CE-StoL and non-CE-StoL model. Red lines denote Train Loss (P-CE-StoL) and Validation Loss (P-CE-StoL), while blue lines denote Train Loss (non-CE-StoL) and Validation Loss (non-CE-StoL), plotted across training iterations. (c) Cumulative distribution of relative planarity improvement, with the x-axis showing the percentage increase of P-CE-StoL over CE-StoL and the y-axis showing the fraction of molecules with improvement less than or equal to that percentage. The yellow-shaded region highlights molecules with improvement greater than 50%. (d) Violin plots of BRMSD values for generated structures with respect to reference conformations in StoL25-init dataset, with the red dashed line marking 1 Å and the blue dashed line marking 2 Å.
  • ...and 1 more figures