Table of Contents
Fetching ...

Pre-training Graph Neural Networks with Structural Fingerprints for Materials Discovery

Shuyi Jia, Shitij Govil, Manav Ramprasad, Victor Fung

TL;DR

This work tackles the data bottleneck in applying graph neural networks to materials discovery by introducing descriptor-based pre-training that uses cheap, physics-informed structural fingerprints as self-generated targets. By pre-training with wACSF, GMP, or EAD descriptors on a large MP Relaxed dataset, the GNN learns transferable representations without requiring expensive quantum-mechanical labels. Fine-tuning on 11 downstream MatBench tasks (plus three specialized sets) shows consistent MAE improvements, with EAD often yielding the strongest gains, while substantially reducing data-generation costs. The results suggest that scalable, descriptor-guided pre-training can form a foundation for billions-scale atomistic datasets and can complement traditional force-field or force-based pre-training approaches, paving the way for more cost-effective large-scale materials discovery pipelines.

Abstract

In recent years, pre-trained graph neural networks (GNNs) have been developed as general models which can be effectively fine-tuned for various potential downstream tasks in materials science, and have shown significant improvements in accuracy and data efficiency. The most widely used pre-training methods currently involve either supervised training to fit a general force field or self-supervised training by denoising atomic structures equilibrium. Both methods require datasets generated from quantum mechanical calculations, which quickly become intractable when scaling to larger datasets. Here we propose a novel pre-training objective which instead uses cheaply-computed structural fingerprints as targets while maintaining comparable performance across a range of different structural descriptors. Our experiments show this approach can act as a general strategy for pre-training GNNs with application towards large scale foundational models for atomistic data.

Pre-training Graph Neural Networks with Structural Fingerprints for Materials Discovery

TL;DR

This work tackles the data bottleneck in applying graph neural networks to materials discovery by introducing descriptor-based pre-training that uses cheap, physics-informed structural fingerprints as self-generated targets. By pre-training with wACSF, GMP, or EAD descriptors on a large MP Relaxed dataset, the GNN learns transferable representations without requiring expensive quantum-mechanical labels. Fine-tuning on 11 downstream MatBench tasks (plus three specialized sets) shows consistent MAE improvements, with EAD often yielding the strongest gains, while substantially reducing data-generation costs. The results suggest that scalable, descriptor-guided pre-training can form a foundation for billions-scale atomistic datasets and can complement traditional force-field or force-based pre-training approaches, paving the way for more cost-effective large-scale materials discovery pipelines.

Abstract

In recent years, pre-trained graph neural networks (GNNs) have been developed as general models which can be effectively fine-tuned for various potential downstream tasks in materials science, and have shown significant improvements in accuracy and data efficiency. The most widely used pre-training methods currently involve either supervised training to fit a general force field or self-supervised training by denoising atomic structures equilibrium. Both methods require datasets generated from quantum mechanical calculations, which quickly become intractable when scaling to larger datasets. Here we propose a novel pre-training objective which instead uses cheaply-computed structural fingerprints as targets while maintaining comparable performance across a range of different structural descriptors. Our experiments show this approach can act as a general strategy for pre-training GNNs with application towards large scale foundational models for atomistic data.

Paper Structure

This paper contains 18 sections, 6 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Visualization of Graph-Level Embeddings: t-distributed stochastic neighbor embedding (t-SNE) plots of graph-level embeddings from the pooling layer for models pre-trained with wACSF, GMP, and EAD. Panels (a)-(c) correspond to the KVRH dataset; panels (d)-(f) correspond to the MOF dataset. Data points are color-coded based on bulk modulus (GPa) and band gap (eV), respectively.
  • Figure 2: MAEs on the fine-tuning datasets for models pre-trained at 12, 25, 50, 100, and 200 epochs. The results corresponding to 200 epochs are from Table \ref{['tab:main-results']}.