Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy

Hao Tang; Brian Xiao; Wenhao He; Pero Subasic; Avetik R. Harutyunyan; Yao Wang; Fang Liu; Haowei Xu; Ju Li

Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy

Hao Tang, Brian Xiao, Wenhao He, Pero Subasic, Avetik R. Harutyunyan, Yao Wang, Fang Liu, Haowei Xu, Ju Li

TL;DR

The paper introduces a multi-task equivariant graph neural network that predicts CCSD(T)-level electronic properties by augmenting a local DFT starting Hamiltonian ${\mathbf F}'$ with a learned correction ${\mathbf V}^{\theta}$ to form ${\mathbf H}^{\rm eff}$. Trained on CCSD(T) data for hydrocarbons, the model delivers high-accuracy predictions for energies, dipoles, quadrupoles, charges, bond orders, and excited-state properties such as the energy gap $E_g$ and polarizability $\alpha$, with perturbation-theory-based back-propagation ensuring stable gradients through the electronic eigenproblem. The approach demonstrates superior accuracy and data efficiency across in-domain and out-of-domain molecules, including aromatic systems and large semiconducting polymers, while offering substantial speed advantages over CCSD(T) and market-standard DFT functionals. This physics-informed, CCSD(T)-aware framework provides a scalable route to accurate electronic structure predictions for complex molecular systems and can be extended to broader multi-element datasets. The method’s integration of an effective Hamiltonian correction with multi-task learning offers a practical tool for computational chemistry, enabling CCSD(T)-level insights at near-linear scaling for materials design and molecular engineering.

Abstract

Machine learning (ML) plays an important role in quantum chemistry, providing fast-to-evaluate predictive models for various properties of molecules. However, most existing ML models for molecular electronic properties use density functional theory (DFT) databases as ground truth in training, and their prediction accuracy cannot surpass that of DFT. In this work, we developed a unified ML method for electronic structures of organic molecules using the gold-standard CCSD(T) calculations as training data. Tested on hydrocarbon molecules, our model outperforms DFT with the widely-used hybrid and double hybrid functionals in computational costs and prediction accuracy of various quantum chemical properties. As case studies, we apply the model to aromatic compounds and semiconducting polymers on both ground state and excited state properties, demonstrating its accuracy and generalization capability to complex systems that are hard to calculate using CCSD(T)-level methods.

Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy

TL;DR

The paper introduces a multi-task equivariant graph neural network that predicts CCSD(T)-level electronic properties by augmenting a local DFT starting Hamiltonian

with a learned correction

to form

. Trained on CCSD(T) data for hydrocarbons, the model delivers high-accuracy predictions for energies, dipoles, quadrupoles, charges, bond orders, and excited-state properties such as the energy gap

and polarizability

, with perturbation-theory-based back-propagation ensuring stable gradients through the electronic eigenproblem. The approach demonstrates superior accuracy and data efficiency across in-domain and out-of-domain molecules, including aromatic systems and large semiconducting polymers, while offering substantial speed advantages over CCSD(T) and market-standard DFT functionals. This physics-informed, CCSD(T)-aware framework provides a scalable route to accurate electronic structure predictions for complex molecular systems and can be extended to broader multi-element datasets. The method’s integration of an effective Hamiltonian correction with multi-task learning offers a practical tool for computational chemistry, enabling CCSD(T)-level insights at near-linear scaling for materials design and molecular engineering.

Abstract

Paper Structure (15 sections, 21 equations, 5 figures, 3 tables)

This paper contains 15 sections, 21 equations, 5 figures, 3 tables.

Introduction
Results
Theory and Model Architecture
Model Performance and Applications
Conclusion and Outlook
Methods
Graph encoding of atomic configuration
Architecture of the convolutional layer
Evaluating molecular properties
Data Availability
Code Availability
Acknowledgements
Perturbation theory-based back-propagation
Dataset and training parameters
Infrared spectrum

Figures (5)

Figure 1: Schematic of the EGNN electronic structure workflow. (a) Computation graph of the EGNN method that calculate multiple quantum chemical properties from atomic configurations inputs. The computational graph consists of input layer (green blocks), convolutional layer (blue block), and output layer (orange blocks). (b) Model architecture of the EGNN that consists of two layers of graph convolution and output both node feature $\mathbf x_{I, \rm out}$ and edge feature $\mathbf f_{IJ, \rm out}$. (c) Training and testing dataset generation. About 10,500 atomic configurations of 85 different hydrocarbon molecules are sampled from molecular dynamics trajectories. Data points are plot in the map of number of electrons and atoms, and the dot size reflects the number of training data with the same chemical formula.
Figure 2: Benchmark of the model performance on the testing dataset. (a) Testing root-mean-square errors (RMSE) of different quantities as a function of training dataset size. (b) Computational costs of different methods plot against number of electrons. The computational costs is measured as the calculation time (node hour) on a single Intel Xeon Platinum 8260 CPU node with 48 cores on the MIT SuperCloud reuther2018interactive with sufficient memory for all calculations. The scaling deviates from the theoretical asymptotic scaling, e.g., $N^7$ for CCSD(T), because the parallelization efficiency is higher for larger molecules. In principle, the $N^7$ scaling for CCSD(T) would appear in the large $N$ limit. (c) Prediction RMSE of the energy ($E$ per atom, reference to separate atoms), electric dipole moment ($\vec{p}$), electric quadrupole moment ($\bf{Q}$), Mulliken atomic charge ($C$), Mayer bond order ($B$), energy gap (1$^{\rm st}$ excitation energy, $E_{\rm g}$), and static electric polarizability ($\alpha$, a.u. means atomic unit). Our EGNN method is compared with the B3LYP hybrid functional, DSD-PBEP86 double hybrid functional kozuch2013spin, DM21 ML functional kirkpatrick2021pushing, and AIQM1 ML potential zheng2021artificialdral2024mlatom. A representative atomic configuration of each chemical formula is plotted for illustration.
Figure 3: Validation of the EGNN's predictions on gas phase aromatic hydrocarbon molecules, as compared with experimental results. (a) Standard enthalpy of formation. The EGNN predictions and experimental values from Ref. slayden2001energetics (right axis) are compared for 11 molecules (see SI Table II for details). The difference between the EGNN method and experimental values are shown by the orange line, and the experimental uncertainty is shown by the red line (left axis). (b) Infrared spectrum of benzene. The experimental data is from the NIST Chemistry WebBook NISTwebbook. Vibration modes corresponding to the peaks are labeled following the convention in Ref. wilson1934normal.
Figure 4: EGNN predictions for the electronic proerpties of semiconducting polymers. (a) Atomic structure and HOMO wavefunctions of t-PA, polyphenylene PPP, and c-PA. The HOMO wavefunctions are visualized by isosurfaces at the level of $\pm 0.01$ Å$^{-2/3}$ (positive isosurface colored blue and negative isosurface colored yellow). (b) Energy gap and (c) static electric polarizability of t-PA (blue lines), PPP (green lines), and c-PA (orange lines) with different polymer chain length. Longitudinal polarizability $\alpha_{xx}$, horizontal polarizability $\alpha_{yy}$, and vertical polarizability $\alpha_{zz}$ are shown as solid, dashed, and dotted lines, respectively. Squares (blue for t-PA and green for PPP) represent literature values for polymers in experiments grem1992realizationheeger2001nobel and correlated calculations otto2004dynamic, and blue dots represent literature values for t-PA oligomers from the MP2 correlated calculations champagne1998assessment.
Figure S1: The distribution of model prediction accuracy on the test dataset compared to the B3LYP DFT calculations using the CCSD(T) results as the ground truth. a-g Cumulative distribution of prediction errors for the (a) energy, (b) electric dipole moment, (c) electric quadrupole moment, (d) Mulliken atomic charge, (e) Mayer bond order, (f) energy gap (1$^{\rm st}$ excitation energy), and (g) static electric polarizability (a.u. means atomic unit). The blue and orange solid lines represent EGNN and B3LYP results on the in-domain testing dataset, and the purple and red dashed lines represent GNN and B3LYP results on the out-of-domain testing dataset, respectively. We denote the model errors at 50%, 80%, and 95% percentile from the bottom to the top by hollow circles.

Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy

TL;DR

Abstract

Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy

Authors

TL;DR

Abstract

Table of Contents

Figures (5)