Table of Contents
Fetching ...

$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

Kuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, Artur Kadurin

TL;DR

The paper presents nabla^2DFT, a universal quantum chemistry dataset for drug-like molecules, expanding nablaDFT with over 1.9 million molecules and more than 12 million conformations, all computed at the $\omega$B97X-D/def2-SVP level. It introduces a benchmark and an extendable training framework to evaluate Hamiltonian prediction, energy/force prediction, and conformational optimization across 12 splits, using 10 neural-network-based models. Key contributions include full Hamiltonians and overlap matrices, wavefunction objects, and thousands of relaxation trajectories to support conformational-optimization research, revealing the importance of large-scale, diverse data for neural potentials. The dataset enables rigorous generalization testing (structure, scaffold, and conformation splits) and highlights current limits in Hamiltonian-prediction models while showing substantial gains in energy/force tasks with more data.

Abstract

Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($ω$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.

$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

TL;DR

The paper presents nabla^2DFT, a universal quantum chemistry dataset for drug-like molecules, expanding nablaDFT with over 1.9 million molecules and more than 12 million conformations, all computed at the B97X-D/def2-SVP level. It introduces a benchmark and an extendable training framework to evaluate Hamiltonian prediction, energy/force prediction, and conformational optimization across 12 splits, using 10 neural-network-based models. Key contributions include full Hamiltonians and overlap matrices, wavefunction objects, and thousands of relaxation trajectories to support conformational-optimization research, revealing the importance of large-scale, diverse data for neural potentials. The dataset enables rigorous generalization testing (structure, scaffold, and conformation splits) and highlights current limits in Hamiltonian-prediction models while showing substantial gains in energy/force tasks with more data.

Abstract

Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level (B97X-D/def2-SVP) for each conformation. Moreover, DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
Paper Structure (13 sections, 4 figures, 9 tables)

This paper contains 13 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Our comprehensive workflow for dataset and benchmark construction as elaborated in Sections \ref{['sec:dataset']} and \ref{['sec:setup_results']}. First, a diverse set of conformations is generated for molecules from the MOSES dataset. Second, Quantum Chemistry (QC) properties are computed for these conformations, accompanied by optimization trajectories. Third, this data is then arranged into training and testing splits. Finally, ten state-of-the-art models are trained and evaluated based on these splits.
  • Figure 2: The figure illustrates the structure of $\nabla^2$DFT, which includes 12 predefined training and test splits designed for agile experimental design. Conformational test splits contain molecules that are also in the training splits, testing the models' ability to generalize to unseen molecular geometries. In contrast, Scaffold and Structure test sets are entirely independent of the training splits, evaluating the models' ability to generalize to completely new molecules.
  • Figure 3: Performance of neural networks on $\mathcal{D}^\text{structures}$ test split. Colours of the bars show the training splits ($\mathcal{D}^{\text{tiny}}$, $\mathcal{D}^{\text{small}}$, $\mathcal{D}^{\text{medium}}$, $\mathcal{D}^{\text{large}}$). Y-axis is log-scales.
  • Figure 4: RMSD between optimized conformations and optimal geometry from DFT optimization.