Table of Contents
Fetching ...

GraphStorm: all-in-one graph machine learning framework for industry applications

Da Zheng, Xiang Song, Qi Zhu, Jian Zhang, Theodore Vasiloudis, Runjie Ma, Houyu Zhang, Zichen Wang, Soji Adeshina, Israt Nisa, Alejandro Mottini, Qingjun Cui, Huzefa Rangwala, Belinda Zeng, Christos Faloutsos, George Karypis

TL;DR

GraphStorm addresses the difficulty of deploying graph machine learning on industry-scale data by delivering an end-to-end, no-code/low-code framework built atop DistDGL. It enables graph construction from tabular data, scalable training/inference for billion-scale graphs, and a rich model zoo with techniques for heterogeneous, text-rich, and featureless-node scenarios. Key contributions include on-the-fly sampling, end-to-end pipelines, and specialized modeling methods such as joint text-graph training, GNN distillation, and optimized link prediction, all validated on MAG and Amazon Review-scale graphs. The framework is production-oriented, demonstrating substantial performance gains and practical deployment in multiple industry applications, with a design that also supports researchers venturing into large-scale graph modeling. GraphStorm’s practical impact lies in lowering the barrier to adopt GML in industry, accelerating prototyping, tuning, and deployment of scalable, high-performance graph models.

Abstract

Graph machine learning (GML) is effective in many business applications. However, making GML easy to use and applicable to industry applications with massive datasets remain challenging. We developed GraphStorm, which provides an end-to-end solution for scalable graph construction, graph model training and inference. GraphStorm has the following desirable properties: (a) Easy to use: it can perform graph construction and model training and inference with just a single command; (b) Expert-friendly: GraphStorm contains many advanced GML modeling techniques to handle complex graph data and improve model performance; (c) Scalable: every component in GraphStorm can operate on graphs with billions of nodes and can scale model training and inference to different hardware without changing any code. GraphStorm has been used and deployed for over a dozen billion-scale industry applications after its release in May 2023. It is open-sourced in Github: https://github.com/awslabs/graphstorm.

GraphStorm: all-in-one graph machine learning framework for industry applications

TL;DR

GraphStorm addresses the difficulty of deploying graph machine learning on industry-scale data by delivering an end-to-end, no-code/low-code framework built atop DistDGL. It enables graph construction from tabular data, scalable training/inference for billion-scale graphs, and a rich model zoo with techniques for heterogeneous, text-rich, and featureless-node scenarios. Key contributions include on-the-fly sampling, end-to-end pipelines, and specialized modeling methods such as joint text-graph training, GNN distillation, and optimized link prediction, all validated on MAG and Amazon Review-scale graphs. The framework is production-oriented, demonstrating substantial performance gains and practical deployment in multiple industry applications, with a design that also supports researchers venturing into large-scale graph modeling. GraphStorm’s practical impact lies in lowering the barrier to adopt GML in industry, accelerating prototyping, tuning, and deployment of scalable, high-performance graph models.

Abstract

Graph machine learning (GML) is effective in many business applications. However, making GML easy to use and applicable to industry applications with massive datasets remain challenging. We developed GraphStorm, which provides an end-to-end solution for scalable graph construction, graph model training and inference. GraphStorm has the following desirable properties: (a) Easy to use: it can perform graph construction and model training and inference with just a single command; (b) Expert-friendly: GraphStorm contains many advanced GML modeling techniques to handle complex graph data and improve model performance; (c) Scalable: every component in GraphStorm can operate on graphs with billions of nodes and can scale model training and inference to different hardware without changing any code. GraphStorm has been used and deployed for over a dozen billion-scale industry applications after its release in May 2023. It is open-sourced in Github: https://github.com/awslabs/graphstorm.
Paper Structure (32 sections, 1 equation, 8 figures, 6 tables)

This paper contains 32 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Easy and Scalable GML with GraphStorm.
  • Figure 2: The functionalities of GraphStorm. The colored lines show examples of constructing a complete model solution in GraphStorm.
  • Figure 3: GraphStorm architecture
  • Figure 4: GraphStorm training script for node classification.
  • Figure 5: Jointly modeling text and graph data on Microsoft Academic Graphs
  • ...and 3 more figures