Table of Contents
Fetching ...

Learning from the electronic structure of molecules across the periodic table

Manasa Kaniselvan, Benjamin Kurt Miller, Meng Gao, Juno Nam, Daniel S. Levine

TL;DR

This work tackles the data bottleneck in machine-learned atomic potentials by leveraging the often-unused Hamiltonian data from electronic-structure calculations. It introduces HELM, a scalable, equivariant GNN that predicts both the Hamiltonian and energies, and OMol_CSH_58k, a diverse, large-scale Hamiltonian dataset. Through Hamiltonian pretraining, the authors demonstrate substantial improvements in energy prediction in low-data regimes, supported by embedding analyses that reveal richer atomic representations. Collectively, the approach provides a practical pathway to incorporate electronic-structure information into MLIPs, enabling more transferable and data-efficient models across the periodic table.

Abstract

Machine-Learned Interatomic Potentials (MLIPs) require vast amounts of atomic structure data to learn forces and energies, and their performance continues to improve with training set size. Meanwhile, the even greater quantities of accompanying data in the Hamiltonian matrix H behind these datasets has so far gone unused for this purpose. Here, we provide a recipe for integrating the orbital interaction data within H towards training pipelines for atomic-level properties. We first introduce HELM ("Hamiltonian-trained Electronic-structure Learning for Molecules"), a state-of-the-art Hamiltonian prediction model which bridges the gap between Hamiltonian prediction and universal MLIPs by scaling to H of structures with 100+ atoms, high elemental diversity, and large basis sets including diffuse functions. To accompany HELM, we release a curated Hamiltonian matrix dataset, 'OMol_CSH_58k', with unprecedented elemental diversity (58 elements), molecular size (up to 150 atoms), and basis set (def2-TZVPD). Finally, we introduce 'Hamiltonian pretraining' as a method to extract meaningful descriptors of atomic environments even from a limited number atomic structures, and repurpose this shared embedding space to improve performance on energy-prediction in low-data regimes. Our results highlight the use of electronic interactions as a rich and transferable data source for representing chemical space.

Learning from the electronic structure of molecules across the periodic table

TL;DR

This work tackles the data bottleneck in machine-learned atomic potentials by leveraging the often-unused Hamiltonian data from electronic-structure calculations. It introduces HELM, a scalable, equivariant GNN that predicts both the Hamiltonian and energies, and OMol_CSH_58k, a diverse, large-scale Hamiltonian dataset. Through Hamiltonian pretraining, the authors demonstrate substantial improvements in energy prediction in low-data regimes, supported by embedding analyses that reveal richer atomic representations. Collectively, the approach provides a practical pathway to incorporate electronic-structure information into MLIPs, enabling more transferable and data-efficient models across the periodic table.

Abstract

Machine-Learned Interatomic Potentials (MLIPs) require vast amounts of atomic structure data to learn forces and energies, and their performance continues to improve with training set size. Meanwhile, the even greater quantities of accompanying data in the Hamiltonian matrix H behind these datasets has so far gone unused for this purpose. Here, we provide a recipe for integrating the orbital interaction data within H towards training pipelines for atomic-level properties. We first introduce HELM ("Hamiltonian-trained Electronic-structure Learning for Molecules"), a state-of-the-art Hamiltonian prediction model which bridges the gap between Hamiltonian prediction and universal MLIPs by scaling to H of structures with 100+ atoms, high elemental diversity, and large basis sets including diffuse functions. To accompany HELM, we release a curated Hamiltonian matrix dataset, 'OMol_CSH_58k', with unprecedented elemental diversity (58 elements), molecular size (up to 150 atoms), and basis set (def2-TZVPD). Finally, we introduce 'Hamiltonian pretraining' as a method to extract meaningful descriptors of atomic environments even from a limited number atomic structures, and repurpose this shared embedding space to improve performance on energy-prediction in low-data regimes. Our results highlight the use of electronic interactions as a rich and transferable data source for representing chemical space.

Paper Structure

This paper contains 24 sections, 16 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Architecture of HELM. Each rectangle in the backbone represents an embedding for an atom or inter-atomic interaction, has a shape of $(l_\mathrm{max}+1)^2 \times C$, and contains a set of learnable spherical harmonic coefficients for each $(l, m)$ up to a specified $l_{max}$. TC = Tensor construction layer to reconstruct the Hamiltonian (see \ref{['sec:processing_H']}).
  • Figure 2: Breakdown of atomic and electronic interactions within the OMol_CSH_58k dataset. (a) Four sample molecules of increasing size and (b) magnitude plots of their Hamiltonian matrices. (c) The full dataset spans 58 atomic elements, as shown in the element distribution histogram; the inset details the corresponding orbital count within the def2-TZVPD basis across the dataset. (d) Element–-element interaction heatmap indicating the diversity of interactions across the dataset. (e) Histogram of the distribution of atoms per structure (ranging from 10 to 150).
  • Figure 3: (a) The three training schemes we use to investigate the effect of Hamiltonian pretraining on the backbone of HELM for downstream energy prediction. (b) and (c) contain the corresponding loss over epochs for each of these schemes, when applied to $\nabla^2$DFT's train-2k split and OMol_CSH_58k. For the latter, the fill-between indicates a distribution of loss over each batch.
  • Figure 4: UMAP visualization of the output node embeddings ($z_i^{(K)}$) of the first 250 molecules of OMol_CSH_58k, extracted from the direct, pretrained-frozen, or finetuned models, and colored by element. There are 100,233 datapoints in total, from 14,319 atoms $\times$ 7 $l$-components ($l_{max}=6$)
  • Figure 5: Dimensions of the (padded) target created for an example water molecule in a DZVP basis (taken from the MD17 dataset), and its decomposition into a direct sum of irreducible representations. The colored circles indicate the $L$s required to represent the interactions in each submatrix.
  • ...and 10 more figures