QCell: Comprehensive Quantum-Mechanical Dataset Spanning Diverse Biomolecular Fragments
Adil Kabylda, Sergio Suárez-Dou, Nils Davoine, Florian N. Brünig, Alexandre Tkatchenko
TL;DR
QCell tackles the lack of comprehensive quantum-mechanical data for biomolecular spaces beyond proteins by providing 525k high-quality QM calculations for fragments across carbohydrates, nucleic acids, lipids, dimers, and ions, all at $PBE0+MBD(-NL)$. The dataset implements a multi-step fragment-generation workflow that combines building blocks, conformational sampling, and pre-optimization with DFTB+MBD before high-level QM calculations, expanding coverage to 82 elements. Together with prior datasets, QCell reaches over $41$ million data points, enabling training of transferable quantum-informed fragment-based force fields capable of modeling long-range interactions and solvation in biomolecules. Validation shows biologically realistic structural distributions and that the SO3LR model achieves sub-kcal/mol force errors across most biomolecular classes, demonstrating practical utility for biomolecular simulations.
Abstract
Recent advances in machine learning force fields (MLFFs) are revolutionizing molecular simulations by bridging the gap between quantum-mechanical (QM) accuracy and the computational efficiency of mechanistic potentials. However, the development of reliable MLFFs for biomolecular systems remains constrained by the scarcity of high-quality, chemically diverse QM datasets that span all of the major classes of biomolecules expressed in living cells. Crucially, such a comprehensive dataset must be computed using non-empirical or minimally empirical approximations to solving the Schrödinger equation. To address these limitations, we introduce the QCell dataset -- a curated collection of 525k new QM calculations for biomolecular fragments encompassing carbohydrates, nucleic acids, lipids, dimers, and ion clusters. QCell complements existing datasets, bringing the total number of available data points to 41 million molecular systems, all calculated using hybrid density functional theory with nonlocal many-body dispersion interactions, as captured by the PBE0+MBD(-NL) level of quantum mechanics. The QCell dataset therefore provides a valuable resource for training next-generation MLFFs capable of modeling the intricate interactions that govern biomolecular dynamics beyond small molecules and proteins.
