Table of Contents
Fetching ...

Reducing the Cost of Quantum Chemical Data By Backpropagating Through Density Functional Theory

Alexander Mathiasen, Hatem Helal, Paul Balanca, Adam Krzywaniak, Ali Parviz, Frederik Hvilshøj, Blazej Banaszewski, Carlo Luschi, Andrew William Fitzgibbon

TL;DR

The paper tackles the data-labeling bottleneck in quantum chemistry posed by DFT's cubic scaling, proposing a Quantum Pre-trained Transformer (QPT) trained with an implicit DFT loss by backpropagating through the energy $E(\cdot)$. This bypasses the need for precomputed Hamiltonian labels, enabling an effectively infinite training stream and allowing scaling of molecular foundation models. The authors implement a 302M-parameter Transformer with an initial DFT guess, quantum biased attention, and density mixing, achieving comparable accuracy to prior supervised methods while reducing overall time (e.g., Uracil: 31h vs 786h total) and avoiding dataset creation costs. This approach holds promise for scalable pretraining on larger molecules, peptides, and protein–ligand systems, potentially changing how quantum-chemical data is generated and used for foundation models.

Abstract

Density Functional Theory (DFT) accurately predicts the quantum chemical properties of molecules, but scales as $O(N_{\text{electrons}}^3)$. Schütt et al. (2019) successfully approximate DFT 1000x faster with Neural Networks (NN). Arguably, the biggest problem one faces when scaling to larger molecules is the cost of DFT labels. For example, it took years to create the PCQ dataset (Nakata & Shimazaki, 2017) on which subsequent NNs are trained within a week. DFT labels molecules by minimizing energy $E(\cdot )$ as a "loss function." We bypass dataset creation by directly training NNs with $E(\cdot )$ as a loss function. For comparison, Schütt et al. (2019) spent 626 hours creating a dataset on which they trained their NN for 160h, for a total of 786h; our method achieves comparable performance within 31h.

Reducing the Cost of Quantum Chemical Data By Backpropagating Through Density Functional Theory

TL;DR

The paper tackles the data-labeling bottleneck in quantum chemistry posed by DFT's cubic scaling, proposing a Quantum Pre-trained Transformer (QPT) trained with an implicit DFT loss by backpropagating through the energy . This bypasses the need for precomputed Hamiltonian labels, enabling an effectively infinite training stream and allowing scaling of molecular foundation models. The authors implement a 302M-parameter Transformer with an initial DFT guess, quantum biased attention, and density mixing, achieving comparable accuracy to prior supervised methods while reducing overall time (e.g., Uracil: 31h vs 786h total) and avoiding dataset creation costs. This approach holds promise for scalable pretraining on larger molecules, peptides, and protein–ligand systems, potentially changing how quantum-chemical data is generated and used for foundation models.

Abstract

Density Functional Theory (DFT) accurately predicts the quantum chemical properties of molecules, but scales as . Schütt et al. (2019) successfully approximate DFT 1000x faster with Neural Networks (NN). Arguably, the biggest problem one faces when scaling to larger molecules is the cost of DFT labels. For example, it took years to create the PCQ dataset (Nakata & Shimazaki, 2017) on which subsequent NNs are trained within a week. DFT labels molecules by minimizing energy as a "loss function." We bypass dataset creation by directly training NNs with as a loss function. For comparison, Schütt et al. (2019) spent 626 hours creating a dataset on which they trained their NN for 160h, for a total of 786h; our method achieves comparable performance within 31h.
Paper Structure (26 sections, 8 equations, 5 figures, 3 tables)

This paper contains 26 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The time to label molecules with Density Functional Theory (DFT) scales as $O(N_{\text{electrons}}^3)$. The cubic scaling makes it impractical to create datasets with molecules like peptides or proteins with larger $N_{\text{electrons}}$. Our Quantum Pre-trained Transformer (QPT) bypasses the expensive DFT labeling by training with "DFT's loss function", the energy $E(\cdot)$. This circumvents the cost of creating datasets, paving a way to scale both the size of molecules and NNs.
  • Figure 2: Visualization of how a molecule gets tokenized so our Transformer can process it and output an appropriately shaped matrix for the DFT energy computation.
  • Figure 3: Time to label a training example using DFT as reported by schnorb, phisnet and qhnet, compared to a forward/backward pass of QPT using energy $E(\cdot)$ as a loss function. Bypassing the expensive labeling gives us the ability to evaluate $E(\cdot)$ on a new $X_i$ each iteration. Triangle is time of DFT using our newer CPUs.
  • Figure 4: Comparison of the models performance on conformations inside the training distribution $(\phi,\psi)\in [0,180]^2$ relative to the performance outside the training distribution $(\phi,\psi)\in[0,360]^2\backslash[0,180]^2$. The model exhibits minor extrapolation to angles $[180,200]$ not seen during training. A typical energy value is around $13000$eV, so an error $\log(|DFT_E-QPT_E|)=-4$ could mean that the neural network predicted $13000.0001$ instead of $13000.0000$. Chemical accuracy is $0.040$eV).
  • Figure 5: On the left, we visualize $QPT_E$ during training on a single validation example, for which we also computed $DFT_E$. Next, we visualize their difference $\Delta E$. We finally present a similar plot for the resulting Hamiltonian $H$ and its molecular orbital energies $\epsilon$.