Table of Contents
Fetching ...

RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases

Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin

TL;DR

RDB2G-Bench introduces the first benchmark framework for evaluating automatic graph-modeling strategies that convert relational databases (RDBs) into graphs for downstream predictive tasks. By precomputing around 50,000 graph models across 5 real-world RDBs and 12 tasks, it enables reproducible, rapid evaluation of 10 modeling methods, including heuristic, search-based, and LLM-inspired approaches, with reported speedups of up to 389x versus on-the-fly evaluation. The study reveals that selective inclusion of tables and modeling choices (e.g., Row2Edge vs Row2Node) significantly impacts performance, and that there is no universal modeling rule across tasks. It also shows cross-GNN generalizability of effective graph models, highlights common substructures among top models, and demonstrates the promising potential of LLM-based approaches despite current limitations. The publicly available datasets and code aim to accelerate progress in RDB-to-graph modeling by enabling efficient, fair comparisons and enabling broader applicability across predictive GNNs.

Abstract

Recent advances have demonstrated the effectiveness of graph-based learning on relational databases (RDBs) for predictive tasks. Such approaches require transforming RDBs into graphs, a process we refer to as RDB-to-graph modeling, where rows of tables are represented as nodes and foreign-key relationships as edges. Yet, effective modeling of RDBs into graphs remains challenging. Specifically, there exist numerous ways to model RDBs into graphs, and performance on predictive tasks varies significantly depending on the chosen graph model of RDBs. In our analysis, we find that the best-performing graph model can yield up to a 10% higher performance compared to the common heuristic rule for graph modeling, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph model-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 10 automatic RDB-to-graph modeling methods on the 12 tasks about 380x faster than on-the-fly evaluation, which requires repeated GNN training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling. Our datasets and code are available at https://github.com/chlehdwon/RDB2G-Bench.

RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases

TL;DR

RDB2G-Bench introduces the first benchmark framework for evaluating automatic graph-modeling strategies that convert relational databases (RDBs) into graphs for downstream predictive tasks. By precomputing around 50,000 graph models across 5 real-world RDBs and 12 tasks, it enables reproducible, rapid evaluation of 10 modeling methods, including heuristic, search-based, and LLM-inspired approaches, with reported speedups of up to 389x versus on-the-fly evaluation. The study reveals that selective inclusion of tables and modeling choices (e.g., Row2Edge vs Row2Node) significantly impacts performance, and that there is no universal modeling rule across tasks. It also shows cross-GNN generalizability of effective graph models, highlights common substructures among top models, and demonstrates the promising potential of LLM-based approaches despite current limitations. The publicly available datasets and code aim to accelerate progress in RDB-to-graph modeling by enabling efficient, fair comparisons and enabling broader applicability across predictive GNNs.

Abstract

Recent advances have demonstrated the effectiveness of graph-based learning on relational databases (RDBs) for predictive tasks. Such approaches require transforming RDBs into graphs, a process we refer to as RDB-to-graph modeling, where rows of tables are represented as nodes and foreign-key relationships as edges. Yet, effective modeling of RDBs into graphs remains challenging. Specifically, there exist numerous ways to model RDBs into graphs, and performance on predictive tasks varies significantly depending on the chosen graph model of RDBs. In our analysis, we find that the best-performing graph model can yield up to a 10% higher performance compared to the common heuristic rule for graph modeling, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph model-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 10 automatic RDB-to-graph modeling methods on the 12 tasks about 380x faster than on-the-fly evaluation, which requires repeated GNN training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling. Our datasets and code are available at https://github.com/chlehdwon/RDB2G-Bench.

Paper Structure

This paper contains 44 sections, 8 equations, 31 figures, 9 tables.

Figures (31)

  • Figure 1: Overview of key concepts. An RDB schema is converted into various network schemas using different RDB-to-graph (RDB2Graph) modeling methods. Graphs are then constructed from these schemas, where graph neural networks (GNNs) are trained and evaluated. In the given example task, optimal modeling yields up to a 5% performance improvement over a widely-used heuristic fey2023relational. Note that the optimal graph model selectively uses tables and foreign key (FK) relations, with table rows modeled as edges, while the heuristic models the entire RDB with all table rows as nodes.
  • Figure 2: (a) We summarize the RDBs, tasks, and their associated graph models. For each classification, regression, and recommendation task, we collect AUC-ROC (%), MAE, and MAP (%), respectively, on each graph model. For each task, we report the performances on the best graph model, the worst model, and that given by AR2N modeling robinson2024relbench. (b) For three tasks (driver-top3, user-attendance, post-post-related), we visualize the distribution of performances on the downstream task (Y-axis) across all graph models, along with training time per epoch (X-axis) and the parameter size of the graph neural network (indicated by color). Note that there exist graph models yielding substantial improvements in both performance and efficiency compared to those generated by widely-used AR2N modeling robinson2024relbench.
  • Figure 3: Modeling table rows as edges (Row2Edge) can be crucial, depending on the task (Obs 2).EA (event_attendees), EI (event_interest), and UF (user_friends) indicate the tables whose rows can be modeled as edges. Note that Row2Edge modeling improves performance for the user-repeat task, but not for the user-ignore task, even when both are defined on the same RDB.
  • Figure 4: Top-performing graph models share common substructures (Obs 3). As shown in their graph models, the top-5 graph models commonly (a) include the foreign-key (FK) relationship events$\rightarrow$users and (b) model either the event_attendees table or the event_interest table as edges. Note that the users_friends table has two FKs (user and friend) both referencing the users table.
  • Figure 5: Different tasks may require different graph models, even on the same RDB (Obs 4). Spearman correlations between downstream task performances on each RDB (rel-f1, rel-event, or rel-avito) are generally low (below 0.4), except for tasks with closely aligned goals.
  • ...and 26 more figures

Theorems & Definitions (1)

  • Definition 1: RDB-to-Graph Modeling