Table of Contents
Fetching ...

Scaling Laws of Graph Neural Networks for Atomistic Materials Modeling

Chaojian Li, Zhifan Ye, Massimiliano Lupo Pasini, Jong Youl Choi, Cheng Wan, Yingyan Celine Lin, Prasanna Balaprakash

TL;DR

This work addresses the scaling of graph neural networks (GNNs) for atomistic materials modeling by pursuing scaling laws with model sizes up to billions of parameters and datasets up to terabytes. It trains a foundational GNN with billions of parameters on terabyte-scale data, leveraging DeepSpeed, activation checkpointing, and the ZeRO optimizer to manage memory across a distributed infrastructure. Key findings include diminishing returns for pure model scaling beyond a few billion parameters, strong data-scale gains, and the effective transfer of techniques from large language and image models to GNNs. The results establish scaling laws, deliver a scalable foundation model and codebase, and provide a pathway toward rapid materials discovery and broader scientific applications.

Abstract

Atomistic materials modeling is a critical task with wide-ranging applications, from drug discovery to materials science, where accurate predictions of the target material property can lead to significant advancements in scientific discovery. Graph Neural Networks (GNNs) represent the state-of-the-art approach for modeling atomistic material data thanks to their capacity to capture complex relational structures. While machine learning performance has historically improved with larger models and datasets, GNNs for atomistic materials modeling remain relatively small compared to large language models (LLMs), which leverage billions of parameters and terabyte-scale datasets to achieve remarkable performance in their respective domains. To address this gap, we explore the scaling limits of GNNs for atomistic materials modeling by developing a foundational model with billions of parameters, trained on extensive datasets in terabyte-scale. Our approach incorporates techniques from LLM libraries to efficiently manage large-scale data and models, enabling both effective training and deployment of these large-scale GNN models. This work addresses three fundamental questions in scaling GNNs: the potential for scaling GNN model architectures, the effect of dataset size on model accuracy, and the applicability of LLM-inspired techniques to GNN architectures. Specifically, the outcomes of this study include (1) insights into the scaling laws for GNNs, highlighting the relationship between model size, dataset volume, and accuracy, (2) a foundational GNN model optimized for atomistic materials modeling, and (3) a GNN codebase enhanced with advanced LLM-based training techniques. Our findings lay the groundwork for large-scale GNNs with billions of parameters and terabyte-scale datasets, establishing a scalable pathway for future advancements in atomistic materials modeling.

Scaling Laws of Graph Neural Networks for Atomistic Materials Modeling

TL;DR

This work addresses the scaling of graph neural networks (GNNs) for atomistic materials modeling by pursuing scaling laws with model sizes up to billions of parameters and datasets up to terabytes. It trains a foundational GNN with billions of parameters on terabyte-scale data, leveraging DeepSpeed, activation checkpointing, and the ZeRO optimizer to manage memory across a distributed infrastructure. Key findings include diminishing returns for pure model scaling beyond a few billion parameters, strong data-scale gains, and the effective transfer of techniques from large language and image models to GNNs. The results establish scaling laws, deliver a scalable foundation model and codebase, and provide a pathway toward rapid materials discovery and broader scientific applications.

Abstract

Atomistic materials modeling is a critical task with wide-ranging applications, from drug discovery to materials science, where accurate predictions of the target material property can lead to significant advancements in scientific discovery. Graph Neural Networks (GNNs) represent the state-of-the-art approach for modeling atomistic material data thanks to their capacity to capture complex relational structures. While machine learning performance has historically improved with larger models and datasets, GNNs for atomistic materials modeling remain relatively small compared to large language models (LLMs), which leverage billions of parameters and terabyte-scale datasets to achieve remarkable performance in their respective domains. To address this gap, we explore the scaling limits of GNNs for atomistic materials modeling by developing a foundational model with billions of parameters, trained on extensive datasets in terabyte-scale. Our approach incorporates techniques from LLM libraries to efficiently manage large-scale data and models, enabling both effective training and deployment of these large-scale GNN models. This work addresses three fundamental questions in scaling GNNs: the potential for scaling GNN model architectures, the effect of dataset size on model accuracy, and the applicability of LLM-inspired techniques to GNN architectures. Specifically, the outcomes of this study include (1) insights into the scaling laws for GNNs, highlighting the relationship between model size, dataset volume, and accuracy, (2) a foundational GNN model optimized for atomistic materials modeling, and (3) a GNN codebase enhanced with advanced LLM-based training techniques. Our findings lay the groundwork for large-scale GNNs with billions of parameters and terabyte-scale datasets, establishing a scalable pathway for future advancements in atomistic materials modeling.

Paper Structure

This paper contains 20 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of large-scale GNNs on multiple commonly-used biology/chemistry materials modeling datasets hu2020open with the foundational GNN developed in this work (indicated by a green star), after scaling both the model size and dataset size.
  • Figure 2: An overview of the developed multi-stack infrastructure for scalable GNN training, integrating data and model configuration, refactored codebase, and multi-GPU machine setup. This unified framework aims to simultaneously improve GNN task accuracy through data-driven model design, ensure code modularity and reuse via codebase restructuring, and optimize training efficiency and scalability with a multi-GPU hardware architecture.
  • Figure 3: The effect of scaling GNN model sizes across various dataset sizes on the final test loss.
  • Figure 4: The effect of scaling atomistic materials modeling dataset sizes across various GNN model sizes on the final test loss.
  • Figure 5: Comparison of how scaling GNN model depth (i.e., number of layers) and width (i.e., number of neurons in each layer) affects the test loss when training on a dataset of 0.4 TB in size.
  • ...and 1 more figures