Table of Contents
Fetching ...

Surprisingly High Redundancy in Electronic Structure Data

Sazzad Hossain, Ponkrshnan Thiagarajan, Shashank Pathrudkar, Stephanie Taylor, Abhijeet S. Gangan, Amartya S. Banerjee, Susanta Ghosh

TL;DR

This work reveals surprisingly high redundancy in electronic structure data used for ML-electron-density prediction across molecules, simple metals, and complex alloys. By benchmarking pruning strategies, it shows that a density-coverage–driven CCS coreset can retain chemical accuracy and model generalizability even when 90–99% of data are removed, while dramatically reducing training times. GraNd pruning struggles at high pruning factors due to coverage loss, whereas random pruning offers modest gains; CCS consistently delivers the best balance between data reduction and predictive fidelity. The findings imply the existence of minimal, essential datasets per material class, enabling faster development of foundation-model style surrogates for electronic structure with broad applicability and lower computational overhead.

Abstract

Accurate prediction of electronic structure underpins advances in chemistry, materials science, and condensed matter physics. In recent years, Machine Learning (ML) has enabled the development of powerful surrogate models that can enable the prediction of the ground state electron density and related properties at a fraction of the computational cost of conventional first principles simulations. Such ML models typically rely on massive datasets generated through expensive Kohn-Sham Density Functional Theory calculations. A key reason for relying on such large datasets is the lack of prior knowledge about which portions of the data are essential, and which are redundant. This study reveals significant redundancies in electronic structure datasets across various material systems, including molecules, simple metals, and chemically complex alloys -- challenging the notion that extensive datasets are essential for accurate ML-based electronic structure predictions. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy. Furthermore, a state-of-the-art coverage-based pruning strategy that selects data across all learning difficulties, retains chemical accuracy and model generalizability using up to 100-fold less data, while reducing training time by threefold or greater. By contrast, widely used importance-based pruning methods, which eliminate easy-to-learn data, can catastrophically fail at higher pruning factors due to significant reduction in data coverage. This heretofore unexplored high redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.

Surprisingly High Redundancy in Electronic Structure Data

TL;DR

This work reveals surprisingly high redundancy in electronic structure data used for ML-electron-density prediction across molecules, simple metals, and complex alloys. By benchmarking pruning strategies, it shows that a density-coverage–driven CCS coreset can retain chemical accuracy and model generalizability even when 90–99% of data are removed, while dramatically reducing training times. GraNd pruning struggles at high pruning factors due to coverage loss, whereas random pruning offers modest gains; CCS consistently delivers the best balance between data reduction and predictive fidelity. The findings imply the existence of minimal, essential datasets per material class, enabling faster development of foundation-model style surrogates for electronic structure with broad applicability and lower computational overhead.

Abstract

Accurate prediction of electronic structure underpins advances in chemistry, materials science, and condensed matter physics. In recent years, Machine Learning (ML) has enabled the development of powerful surrogate models that can enable the prediction of the ground state electron density and related properties at a fraction of the computational cost of conventional first principles simulations. Such ML models typically rely on massive datasets generated through expensive Kohn-Sham Density Functional Theory calculations. A key reason for relying on such large datasets is the lack of prior knowledge about which portions of the data are essential, and which are redundant. This study reveals significant redundancies in electronic structure datasets across various material systems, including molecules, simple metals, and chemically complex alloys -- challenging the notion that extensive datasets are essential for accurate ML-based electronic structure predictions. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy. Furthermore, a state-of-the-art coverage-based pruning strategy that selects data across all learning difficulties, retains chemical accuracy and model generalizability using up to 100-fold less data, while reducing training time by threefold or greater. By contrast, widely used importance-based pruning methods, which eliminate easy-to-learn data, can catastrophically fail at higher pruning factors due to significant reduction in data coverage. This heretofore unexplored high redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.

Paper Structure

This paper contains 21 sections, 5 equations, 27 figures, 1 table.

Figures (27)

  • Figure 1: Error in ML-predicted electron density for the original dataset, 90% and 99% randomly, and CCS-based pruned datasets. Each ML model was trained three times and the mean error is reported.
  • Figure 2: Error in energy with respect to KS-DFT, as obtained from the ML-predicted electron density for the original dataset, 90% CCS, and 99% CCS-based pruned datasets. The electron density prediction from one of the ML models was postprocessed in each case.
  • Figure 3: (a) $H^1$ seminorm and (b) $H^1$ norm of the error for the ML-predicted electron density obtained from the original dataset, $90\%$, and $99\%$ randomly and CCS based pruned dataset.
  • Figure 4: Two-dimensional (2D) slices showing electron density obtained by KS-DFT, ML model trained on the original dataset, 90% CCS, and 99% CCS based pruned dataset for (a) Aluminum at 1500K (b) Water at 600K (c) SiGeSn at 2400K (d) CrFeCoNi for 5000K. The electron density($\rho$) along the solid line is also compared with the KS-DFT and the ML-predicted electron densities show remarkable agreement with the KS-DFT, true for all three ML models across four systems. The unit of electron density is $\text{Bohr}^{-3}$ in atomic units.
  • Figure 5: The error in the electron density prediction for various pruning factors for (a) Aluminum (b) Water (c) SiGeSn (d) CrFeCoNi. The shaded region represents the range (maximum to minimum) of three ML models around the mean, shown by a solid line.
  • ...and 22 more figures