Chemistry-Enhanced Diffusion-Based Framework for Small-to-Large Molecular Conformation Generation
Yifei Zhu, Jiahui Zhang, Jiawei Peng, Mengge Li, Chao Xu, Zhenggang Lan
TL;DR
StoL introduces a chemistry-enhanced diffusion framework that builds large-molecule 3D conformations from SMILES via fragmentation, fragment-level diffusion generation, and chemistry-constrained assembly. The method avoids requiring large-molecule training data by learning on small fragments and embedding chemical priors through a two-phase training regime (data-driven followed by chemistry-enhanced) that uses Sinkhorn soft matching, Gumbel-Softmax hard matching, and planarity constraints. Empirical results on Vorinostat and the StoL25-init dataset show broader conformational coverage and the ability to discover energetically favorable minima after DFT refinement, outperforming purely data-driven or RDKit-based approaches in diversity and quality. By enabling end-to-end generation with data efficiency and transferability, StoL demonstrates a practical pathway for scalable conformer generation that integrates physical chemistry knowledge into diffusion-based models, with public data and code to support further research.
Abstract
Obtaining 3D conformations of realistic polyatomic molecules at the quantum chemistry level remains challenging, and although recent machine learning advances offer promise, predicting large-molecule structures still requires substantial computational effort. Here, we introduce StoL, a diffusion model-based framework that enables rapid and knowledge-free generation of large molecular structures from small-molecule data. Remarkably, StoL assembles molecules in a LEGO-style fashion from scratch, without seeing the target molecules or any structures of comparable size during training. Given a SMILES input, it decomposes the molecule into chemically valid fragments, generates their 3D structures with a diffusion model trained on small molecules, and assembles them into diverse conformations. This fragment-based strategy eliminates the need for large-molecule training data while maintaining high scalability and transferability. By embedding chemical principles into key steps, StoL ensures faster convergence, chemically rational structures, and broad configurational coverage, as confirmed against DFT calculations.
