Table of Contents
Fetching ...

Data-Driven Parametrization of Molecular Mechanics Force Fields for Expansive Chemical Space Coverage

Tianze Zheng, Ailun Wang, Xu Han, Yu Xia, Xingyuan Xu, Jiawei Zhan, Yu Liu, Yang Chen, Zhi Wang, Xiaojie Wu, Sheng Gong, Wen Yan

TL;DR

To address the challenge of achieving accurate molecular mechanics energy predictions across expansive drug-like chemical space, the authors introduce ByteFF, an Amber-compatible, data-driven MM force field trained on a large quantum-mechanics dataset. ByteFF leverages an edge-augmented, symmetry-preserving graph transformer to predict all MM parameters in one pass, trained with a three-stage strategy that includes a differentiable partial Hessian loss and ensemble averaging for uncertainty. The dataset comprises 2.4 million optimized fragments with Hessians and 3.2 million torsion profiles generated at the B3LYP-D3(BJ)/DZVP level, enabling broad chemical-space coverage. Benchmark results show ByteFF achieving state-of-the-art performance for relaxed geometries, torsional energy profiles, and conformational energies/forces, promising improved reliability for drug discovery MD simulations.

Abstract

A force field is a critical component in molecular dynamics simulations for computational drug discovery. It must achieve high accuracy within the constraints of molecular mechanics' (MM) limited functional forms, which offers high computational efficiency. With the rapid expansion of synthetically accessible chemical space, traditional look-up table approaches face significant challenges. In this study, we address this issue using a modern data-driven approach, developing ByteFF, an Amber-compatible force field for drug-like molecules. To create ByteFF, we generated an expansive and highly diverse molecular dataset at the B3LYP-D3(BJ)/DZVP level of theory. This dataset includes 2.4 million optimized molecular fragment geometries with analytical Hessian matrices, along with 3.2 million torsion profiles. We then trained an edge-augmented, symmetry-preserving molecular graph neural network (GNN) on this dataset, employing a carefully optimized training strategy. Our model predicts all bonded and non-bonded MM force field parameters for drug-like molecules simultaneously across a broad chemical space. ByteFF demonstrates state-of-the-art performance on various benchmark datasets, excelling in predicting relaxed geometries, torsional energy profiles, and conformational energies and forces. Its exceptional accuracy and expansive chemical space coverage make ByteFF a valuable tool for multiple stages of computational drug discovery.

Data-Driven Parametrization of Molecular Mechanics Force Fields for Expansive Chemical Space Coverage

TL;DR

To address the challenge of achieving accurate molecular mechanics energy predictions across expansive drug-like chemical space, the authors introduce ByteFF, an Amber-compatible, data-driven MM force field trained on a large quantum-mechanics dataset. ByteFF leverages an edge-augmented, symmetry-preserving graph transformer to predict all MM parameters in one pass, trained with a three-stage strategy that includes a differentiable partial Hessian loss and ensemble averaging for uncertainty. The dataset comprises 2.4 million optimized fragments with Hessians and 3.2 million torsion profiles generated at the B3LYP-D3(BJ)/DZVP level, enabling broad chemical-space coverage. Benchmark results show ByteFF achieving state-of-the-art performance for relaxed geometries, torsional energy profiles, and conformational energies/forces, promising improved reliability for drug discovery MD simulations.

Abstract

A force field is a critical component in molecular dynamics simulations for computational drug discovery. It must achieve high accuracy within the constraints of molecular mechanics' (MM) limited functional forms, which offers high computational efficiency. With the rapid expansion of synthetically accessible chemical space, traditional look-up table approaches face significant challenges. In this study, we address this issue using a modern data-driven approach, developing ByteFF, an Amber-compatible force field for drug-like molecules. To create ByteFF, we generated an expansive and highly diverse molecular dataset at the B3LYP-D3(BJ)/DZVP level of theory. This dataset includes 2.4 million optimized molecular fragment geometries with analytical Hessian matrices, along with 3.2 million torsion profiles. We then trained an edge-augmented, symmetry-preserving molecular graph neural network (GNN) on this dataset, employing a carefully optimized training strategy. Our model predicts all bonded and non-bonded MM force field parameters for drug-like molecules simultaneously across a broad chemical space. ByteFF demonstrates state-of-the-art performance on various benchmark datasets, excelling in predicting relaxed geometries, torsional energy profiles, and conformational energies and forces. Its exceptional accuracy and expansive chemical space coverage make ByteFF a valuable tool for multiple stages of computational drug discovery.
Paper Structure (16 sections, 2 equations, 6 figures, 2 tables)

This paper contains 16 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Model structure of ByteFF. ByteFF predicts MMFF parameters in three steps. First, atom and bond features are extracted from the molecular graph and then projected into embeddings. Then, an edge-augmented graph transformer (EGT)hussainGlobalSelfAttentionReplacement2022 is used to synergize the edge embeddings with the node-based attention mechanism. Lastly, the output module derives force field parameters while preserving the molecular symmetry and total charge.
  • Figure 2: t-SNE analysis of different datasets. The Morgan-based torsional fingerprint analysis results are illustrated using the t-SNE algorithm. Every scatter dot corresponds to a torsion profile being analyzed in the corresponding dataset, which is colored by the element types of the two center atoms in the torsion.
  • Figure 3: Histograms of the discrepancy of torsional PES between predictions of QM and force fields. As a comprehensive benchmark, two metrics including Boltzmann RMSE (a-c) and RMSE (d-f) are used to assess the accuracy of force field-predicted torsional energy profiles with respect to the QM results. Three datasets were included in this benchmark: TorsionNet500 (a & d), BDTorsion-NonRing (b & e), and BDTorsion-InRing (c & f).
  • Figure 4: Example of the in-ring and non-ring torsion prediction accuracy of various force fields. As examples to show the accuracy of ByteFF models in predicting torsional energy profiles, an in-ring (a-b) and a non-ring (c-d) example molecule are provided. The torsional energy profiles predicted by various force fields are compared with the QM references and shown for each example molecule.
  • Figure 5: Histograms of different metrics on OpenFFBenchmark dataset. The accuracy of force field-relaxed geometry relative to QM-relaxed references is quantified with (a) RMSD and (b) Torsion Fingerprint Deviation (TFD) scores. The energetic accuracy is quantified by (c) $\Delta\Delta E$ distributions. All benchmark results for "OPLS4 cst", and the detailed protocols to calculate the benchmark results are obtained from ref damoreCollaborativeAssessmentMolecular2022.
  • ...and 1 more figures