Table of Contents
Fetching ...

Graph Data Management and Graph Machine Learning: Synergies and Opportunities

Arijit Khan, Xiangyu Ke, Yinghui Wu

TL;DR

This survey addresses the integration of graph data management (GDM) and graph machine learning (GML) to build scalable, explainable graph-centric data pipelines. It synthesizes techniques across data cleaning and augmentation, scalable graph embedding and GNN training, graph-based vector indexes, explainability, and knowledge-graph querying with graph-RAG-LLM workflows. By identifying three core synergy modes—GDM<->GML enhancement, GDM for downstream ML, and ML-driven data management improvements—the paper maps concrete methods and system designs that enable end-to-end graph intelligence at scale. The work provides a framework for researchers and practitioners to design robust, efficient, and trustworthy graph data ecosystems, and outlines compelling future directions such as real-time learning, privacy-preserving ML, and unified LLM/KG/Vector DB architectures.

Abstract

The ubiquity of machine learning, particularly deep learning, applied to graphs is evident in applications ranging from cheminformatics (drug discovery) and bioinformatics (protein interaction prediction) to knowledge graph-based query answering, fraud detection, and social network analysis. Concurrently, graph data management deals with the research and development of effective, efficient, scalable, robust, and user-friendly systems and algorithms for storing, processing, and analyzing vast quantities of heterogeneous and complex graph data. Our survey provides a comprehensive overview of the synergies between graph data management and graph machine learning, illustrating how they intertwine and mutually reinforce each other across the entire spectrum of the graph data science and machine learning pipeline. Specifically, the survey highlights two crucial aspects: (1) How graph data management enhances graph machine learning, including contributions such as improved graph neural network performance through graph data cleaning, scalable graph embedding, efficient graph-based vector data management, robust graph neural networks, user-friendly explainability methods; and (2) how graph machine learning, in turn, aids in graph data management, with a focus on applications like query answering over knowledge graphs and various data science tasks. We discuss pertinent open problems and delineate crucial research directions.

Graph Data Management and Graph Machine Learning: Synergies and Opportunities

TL;DR

This survey addresses the integration of graph data management (GDM) and graph machine learning (GML) to build scalable, explainable graph-centric data pipelines. It synthesizes techniques across data cleaning and augmentation, scalable graph embedding and GNN training, graph-based vector indexes, explainability, and knowledge-graph querying with graph-RAG-LLM workflows. By identifying three core synergy modes—GDM<->GML enhancement, GDM for downstream ML, and ML-driven data management improvements—the paper maps concrete methods and system designs that enable end-to-end graph intelligence at scale. The work provides a framework for researchers and practitioners to design robust, efficient, and trustworthy graph data ecosystems, and outlines compelling future directions such as real-time learning, privacy-preserving ML, and unified LLM/KG/Vector DB architectures.

Abstract

The ubiquity of machine learning, particularly deep learning, applied to graphs is evident in applications ranging from cheminformatics (drug discovery) and bioinformatics (protein interaction prediction) to knowledge graph-based query answering, fraud detection, and social network analysis. Concurrently, graph data management deals with the research and development of effective, efficient, scalable, robust, and user-friendly systems and algorithms for storing, processing, and analyzing vast quantities of heterogeneous and complex graph data. Our survey provides a comprehensive overview of the synergies between graph data management and graph machine learning, illustrating how they intertwine and mutually reinforce each other across the entire spectrum of the graph data science and machine learning pipeline. Specifically, the survey highlights two crucial aspects: (1) How graph data management enhances graph machine learning, including contributions such as improved graph neural network performance through graph data cleaning, scalable graph embedding, efficient graph-based vector data management, robust graph neural networks, user-friendly explainability methods; and (2) how graph machine learning, in turn, aids in graph data management, with a focus on applications like query answering over knowledge graphs and various data science tasks. We discuss pertinent open problems and delineate crucial research directions.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure.

Figures (1)

  • Figure 1: Graph data pipeline in data science and machine learning applications. Graph embedding can be task-specific or task-agnostic. Graph neural network (GNN) training can be end-to-end based on downstream tasks. We show which phases belong to GDM and which belong to GML, and can benefit from each other.