Table of Contents
Fetching ...

QCell: Comprehensive Quantum-Mechanical Dataset Spanning Diverse Biomolecular Fragments

Adil Kabylda, Sergio Suárez-Dou, Nils Davoine, Florian N. Brünig, Alexandre Tkatchenko

TL;DR

QCell tackles the lack of comprehensive quantum-mechanical data for biomolecular spaces beyond proteins by providing 525k high-quality QM calculations for fragments across carbohydrates, nucleic acids, lipids, dimers, and ions, all at $PBE0+MBD(-NL)$. The dataset implements a multi-step fragment-generation workflow that combines building blocks, conformational sampling, and pre-optimization with DFTB+MBD before high-level QM calculations, expanding coverage to 82 elements. Together with prior datasets, QCell reaches over $41$ million data points, enabling training of transferable quantum-informed fragment-based force fields capable of modeling long-range interactions and solvation in biomolecules. Validation shows biologically realistic structural distributions and that the SO3LR model achieves sub-kcal/mol force errors across most biomolecular classes, demonstrating practical utility for biomolecular simulations.

Abstract

Recent advances in machine learning force fields (MLFFs) are revolutionizing molecular simulations by bridging the gap between quantum-mechanical (QM) accuracy and the computational efficiency of mechanistic potentials. However, the development of reliable MLFFs for biomolecular systems remains constrained by the scarcity of high-quality, chemically diverse QM datasets that span all of the major classes of biomolecules expressed in living cells. Crucially, such a comprehensive dataset must be computed using non-empirical or minimally empirical approximations to solving the Schrödinger equation. To address these limitations, we introduce the QCell dataset -- a curated collection of 525k new QM calculations for biomolecular fragments encompassing carbohydrates, nucleic acids, lipids, dimers, and ion clusters. QCell complements existing datasets, bringing the total number of available data points to 41 million molecular systems, all calculated using hybrid density functional theory with nonlocal many-body dispersion interactions, as captured by the PBE0+MBD(-NL) level of quantum mechanics. The QCell dataset therefore provides a valuable resource for training next-generation MLFFs capable of modeling the intricate interactions that govern biomolecular dynamics beyond small molecules and proteins.

QCell: Comprehensive Quantum-Mechanical Dataset Spanning Diverse Biomolecular Fragments

TL;DR

QCell tackles the lack of comprehensive quantum-mechanical data for biomolecular spaces beyond proteins by providing 525k high-quality QM calculations for fragments across carbohydrates, nucleic acids, lipids, dimers, and ions, all at . The dataset implements a multi-step fragment-generation workflow that combines building blocks, conformational sampling, and pre-optimization with DFTB+MBD before high-level QM calculations, expanding coverage to 82 elements. Together with prior datasets, QCell reaches over million data points, enabling training of transferable quantum-informed fragment-based force fields capable of modeling long-range interactions and solvation in biomolecules. Validation shows biologically realistic structural distributions and that the SO3LR model achieves sub-kcal/mol force errors across most biomolecular classes, demonstrating practical utility for biomolecular simulations.

Abstract

Recent advances in machine learning force fields (MLFFs) are revolutionizing molecular simulations by bridging the gap between quantum-mechanical (QM) accuracy and the computational efficiency of mechanistic potentials. However, the development of reliable MLFFs for biomolecular systems remains constrained by the scarcity of high-quality, chemically diverse QM datasets that span all of the major classes of biomolecules expressed in living cells. Crucially, such a comprehensive dataset must be computed using non-empirical or minimally empirical approximations to solving the Schrödinger equation. To address these limitations, we introduce the QCell dataset -- a curated collection of 525k new QM calculations for biomolecular fragments encompassing carbohydrates, nucleic acids, lipids, dimers, and ion clusters. QCell complements existing datasets, bringing the total number of available data points to 41 million molecular systems, all calculated using hybrid density functional theory with nonlocal many-body dispersion interactions, as captured by the PBE0+MBD(-NL) level of quantum mechanics. The QCell dataset therefore provides a valuable resource for training next-generation MLFFs capable of modeling the intricate interactions that govern biomolecular dynamics beyond small molecules and proteins.

Paper Structure

This paper contains 9 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview.A) Composition of a bacterial cell by weight, with a breakdown of the chemical constituents alberts2022molecular; about 40% of these compounds are not properly covered in existing datasets. B) Multi‑step workflow used to construct QCell, beginning with the selection of building blocks, followed by conformational sampling and fragment selection, pre‑optimization with DFTB+MBD, and finally hybrid PBE0+MBD($-\mathrm{NL}$) calculations. C) Coverage of molecular species at the PBE0+MBD(-NL) level of theory, including entries from existing databases and newly generated QCell data for nucleic acid fragments, lipids, sugars, solvated ions, and dimers
  • Figure 2: Structural distributions across (bio)molecular datasets, with representative structures. A) Distribution of intra-strand phosphate–phosphate distances (left) and backbone bending angles (middle) in DNA trimers, compared to reference values of A-, B-, and Z-DNA allemand1998stretchedMitchell1998. B) Radius of gyration distribution of fatty acid fragments with more than 300 atoms. C) Distribution of O/N-glycosidic linkage dihedrals in carbohydrates. D) Pair distance distributions for ions and water Marcus1988bruni2012aqueoussoper2013radialsahle2022hydration
  • Figure 3: Test set errors for machine learning force fields. Force mean absolute errors [kcal/mol/] for SO3LR models of increasing size (small, medium, large) across all the training subsets, illustrating systematic error reduction with model capacity and consistent data quality across chemically diverse systems