Pre-training Graph Neural Networks with Structural Fingerprints for Materials Discovery
Shuyi Jia, Shitij Govil, Manav Ramprasad, Victor Fung
TL;DR
This work tackles the data bottleneck in applying graph neural networks to materials discovery by introducing descriptor-based pre-training that uses cheap, physics-informed structural fingerprints as self-generated targets. By pre-training with wACSF, GMP, or EAD descriptors on a large MP Relaxed dataset, the GNN learns transferable representations without requiring expensive quantum-mechanical labels. Fine-tuning on 11 downstream MatBench tasks (plus three specialized sets) shows consistent MAE improvements, with EAD often yielding the strongest gains, while substantially reducing data-generation costs. The results suggest that scalable, descriptor-guided pre-training can form a foundation for billions-scale atomistic datasets and can complement traditional force-field or force-based pre-training approaches, paving the way for more cost-effective large-scale materials discovery pipelines.
Abstract
In recent years, pre-trained graph neural networks (GNNs) have been developed as general models which can be effectively fine-tuned for various potential downstream tasks in materials science, and have shown significant improvements in accuracy and data efficiency. The most widely used pre-training methods currently involve either supervised training to fit a general force field or self-supervised training by denoising atomic structures equilibrium. Both methods require datasets generated from quantum mechanical calculations, which quickly become intractable when scaling to larger datasets. Here we propose a novel pre-training objective which instead uses cheaply-computed structural fingerprints as targets while maintaining comparable performance across a range of different structural descriptors. Our experiments show this approach can act as a general strategy for pre-training GNNs with application towards large scale foundational models for atomistic data.
