Reducing the Cost of Quantum Chemical Data By Backpropagating Through Density Functional Theory
Alexander Mathiasen, Hatem Helal, Paul Balanca, Adam Krzywaniak, Ali Parviz, Frederik Hvilshøj, Blazej Banaszewski, Carlo Luschi, Andrew William Fitzgibbon
TL;DR
The paper tackles the data-labeling bottleneck in quantum chemistry posed by DFT's cubic scaling, proposing a Quantum Pre-trained Transformer (QPT) trained with an implicit DFT loss by backpropagating through the energy $E(\cdot)$. This bypasses the need for precomputed Hamiltonian labels, enabling an effectively infinite training stream and allowing scaling of molecular foundation models. The authors implement a 302M-parameter Transformer with an initial DFT guess, quantum biased attention, and density mixing, achieving comparable accuracy to prior supervised methods while reducing overall time (e.g., Uracil: 31h vs 786h total) and avoiding dataset creation costs. This approach holds promise for scalable pretraining on larger molecules, peptides, and protein–ligand systems, potentially changing how quantum-chemical data is generated and used for foundation models.
Abstract
Density Functional Theory (DFT) accurately predicts the quantum chemical properties of molecules, but scales as $O(N_{\text{electrons}}^3)$. Schütt et al. (2019) successfully approximate DFT 1000x faster with Neural Networks (NN). Arguably, the biggest problem one faces when scaling to larger molecules is the cost of DFT labels. For example, it took years to create the PCQ dataset (Nakata & Shimazaki, 2017) on which subsequent NNs are trained within a week. DFT labels molecules by minimizing energy $E(\cdot )$ as a "loss function." We bypass dataset creation by directly training NNs with $E(\cdot )$ as a loss function. For comparison, Schütt et al. (2019) spent 626 hours creating a dataset on which they trained their NN for 160h, for a total of 786h; our method achieves comparable performance within 31h.
