Surprisingly High Redundancy in Electronic Structure Data
Sazzad Hossain, Ponkrshnan Thiagarajan, Shashank Pathrudkar, Stephanie Taylor, Abhijeet S. Gangan, Amartya S. Banerjee, Susanta Ghosh
TL;DR
This work reveals surprisingly high redundancy in electronic structure data used for ML-electron-density prediction across molecules, simple metals, and complex alloys. By benchmarking pruning strategies, it shows that a density-coverage–driven CCS coreset can retain chemical accuracy and model generalizability even when 90–99% of data are removed, while dramatically reducing training times. GraNd pruning struggles at high pruning factors due to coverage loss, whereas random pruning offers modest gains; CCS consistently delivers the best balance between data reduction and predictive fidelity. The findings imply the existence of minimal, essential datasets per material class, enabling faster development of foundation-model style surrogates for electronic structure with broad applicability and lower computational overhead.
Abstract
Accurate prediction of electronic structure underpins advances in chemistry, materials science, and condensed matter physics. In recent years, Machine Learning (ML) has enabled the development of powerful surrogate models that can enable the prediction of the ground state electron density and related properties at a fraction of the computational cost of conventional first principles simulations. Such ML models typically rely on massive datasets generated through expensive Kohn-Sham Density Functional Theory calculations. A key reason for relying on such large datasets is the lack of prior knowledge about which portions of the data are essential, and which are redundant. This study reveals significant redundancies in electronic structure datasets across various material systems, including molecules, simple metals, and chemically complex alloys -- challenging the notion that extensive datasets are essential for accurate ML-based electronic structure predictions. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy. Furthermore, a state-of-the-art coverage-based pruning strategy that selects data across all learning difficulties, retains chemical accuracy and model generalizability using up to 100-fold less data, while reducing training time by threefold or greater. By contrast, widely used importance-based pruning methods, which eliminate easy-to-learn data, can catastrophically fail at higher pruning factors due to significant reduction in data coverage. This heretofore unexplored high redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.
