Graph Data Management and Graph Machine Learning: Synergies and Opportunities
Arijit Khan, Xiangyu Ke, Yinghui Wu
TL;DR
This survey addresses the integration of graph data management (GDM) and graph machine learning (GML) to build scalable, explainable graph-centric data pipelines. It synthesizes techniques across data cleaning and augmentation, scalable graph embedding and GNN training, graph-based vector indexes, explainability, and knowledge-graph querying with graph-RAG-LLM workflows. By identifying three core synergy modes—GDM<->GML enhancement, GDM for downstream ML, and ML-driven data management improvements—the paper maps concrete methods and system designs that enable end-to-end graph intelligence at scale. The work provides a framework for researchers and practitioners to design robust, efficient, and trustworthy graph data ecosystems, and outlines compelling future directions such as real-time learning, privacy-preserving ML, and unified LLM/KG/Vector DB architectures.
Abstract
The ubiquity of machine learning, particularly deep learning, applied to graphs is evident in applications ranging from cheminformatics (drug discovery) and bioinformatics (protein interaction prediction) to knowledge graph-based query answering, fraud detection, and social network analysis. Concurrently, graph data management deals with the research and development of effective, efficient, scalable, robust, and user-friendly systems and algorithms for storing, processing, and analyzing vast quantities of heterogeneous and complex graph data. Our survey provides a comprehensive overview of the synergies between graph data management and graph machine learning, illustrating how they intertwine and mutually reinforce each other across the entire spectrum of the graph data science and machine learning pipeline. Specifically, the survey highlights two crucial aspects: (1) How graph data management enhances graph machine learning, including contributions such as improved graph neural network performance through graph data cleaning, scalable graph embedding, efficient graph-based vector data management, robust graph neural networks, user-friendly explainability methods; and (2) how graph machine learning, in turn, aids in graph data management, with a focus on applications like query answering over knowledge graphs and various data science tasks. We discuss pertinent open problems and delineate crucial research directions.
