Table of Contents
Fetching ...

4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs

Minjie Wang, Quan Gan, David Wipf, Zhenkun Cai, Ning Li, Jianheng Tang, Yanlin Zhang, Zizhao Zhang, Zunyao Mao, Yakun Song, Yanbo Wang, Jiahang Li, Han Zhang, Guang Yang, Xiao Qin, Chuan Lei, Muhan Zhang, Weinan Zhang, Christos Faloutsos, Zheng Zhang

TL;DR

The paper introduces 4DBInfer, a scalable, open-source toolbox to benchmark predictive modeling on relational databases across four dimensions: datasets, tasks, graph extraction strategies, and predictive baselines. It formalizes a unified framework that converts multi-table RDBs into graphs, distills subgraphs via sampling, and trains models (both tabular and graph-based) on the resulting subgraphs, accommodating inductive and transductive settings. A new suite of large-scale, diverse RDB benchmarks is proposed to avoid information loss from over-curated graphs and to reflect real-world relational structure, with the tool enabling head-to-head comparisons across the 4D design space. Empirical results show that both DFS-based late fusion and GNN-based early fusion methods can outperform naïve single-table or simple-join baselines, underscoring the importance of considering multiple dimensions when modeling RDB data and suggesting that optimal solutions may lie at the tabular-graph boundary.

Abstract

Although RDBs store vast amounts of rich, informative data spread across interconnected tables, the progress of predictive machine learning models as applied to such tasks arguably falls well behind advances in other domains such as computer vision or natural language processing. This deficit stems, at least in part, from the lack of established/public RDB benchmarks as needed for training and evaluation purposes. As a result, related model development thus far often defaults to tabular approaches trained on ubiquitous single-table benchmarks, or on the relational side, graph-based alternatives such as GNNs applied to a completely different set of graph datasets devoid of tabular characteristics. To more precisely target RDBs lying at the nexus of these two complementary regimes, we explore a broad class of baseline models predicated on: (i) converting multi-table datasets into graphs using various strategies equipped with efficient subsampling, while preserving tabular characteristics; and (ii) trainable models with well-matched inductive biases that output predictions based on these input subgraphs. Then, to address the dearth of suitable public benchmarks and reduce siloed comparisons, we assemble a diverse collection of (i) large-scale RDB datasets and (ii) coincident predictive tasks. From a delivery standpoint, we operationalize the above four dimensions (4D) of exploration within a unified, scalable open-source toolbox called 4DBInfer. We conclude by presenting evaluations using 4DBInfer, the results of which highlight the importance of considering each such dimension in the design of RDB predictive models, as well as the limitations of more naive approaches such as simply joining adjacent tables. Our source code is released at https://github.com/awslabs/multi-table-benchmark .

4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs

TL;DR

The paper introduces 4DBInfer, a scalable, open-source toolbox to benchmark predictive modeling on relational databases across four dimensions: datasets, tasks, graph extraction strategies, and predictive baselines. It formalizes a unified framework that converts multi-table RDBs into graphs, distills subgraphs via sampling, and trains models (both tabular and graph-based) on the resulting subgraphs, accommodating inductive and transductive settings. A new suite of large-scale, diverse RDB benchmarks is proposed to avoid information loss from over-curated graphs and to reflect real-world relational structure, with the tool enabling head-to-head comparisons across the 4D design space. Empirical results show that both DFS-based late fusion and GNN-based early fusion methods can outperform naïve single-table or simple-join baselines, underscoring the importance of considering multiple dimensions when modeling RDB data and suggesting that optimal solutions may lie at the tabular-graph boundary.

Abstract

Although RDBs store vast amounts of rich, informative data spread across interconnected tables, the progress of predictive machine learning models as applied to such tasks arguably falls well behind advances in other domains such as computer vision or natural language processing. This deficit stems, at least in part, from the lack of established/public RDB benchmarks as needed for training and evaluation purposes. As a result, related model development thus far often defaults to tabular approaches trained on ubiquitous single-table benchmarks, or on the relational side, graph-based alternatives such as GNNs applied to a completely different set of graph datasets devoid of tabular characteristics. To more precisely target RDBs lying at the nexus of these two complementary regimes, we explore a broad class of baseline models predicated on: (i) converting multi-table datasets into graphs using various strategies equipped with efficient subsampling, while preserving tabular characteristics; and (ii) trainable models with well-matched inductive biases that output predictions based on these input subgraphs. Then, to address the dearth of suitable public benchmarks and reduce siloed comparisons, we assemble a diverse collection of (i) large-scale RDB datasets and (ii) coincident predictive tasks. From a delivery standpoint, we operationalize the above four dimensions (4D) of exploration within a unified, scalable open-source toolbox called 4DBInfer. We conclude by presenting evaluations using 4DBInfer, the results of which highlight the importance of considering each such dimension in the design of RDB predictive models, as well as the limitations of more naive approaches such as simply joining adjacent tables. Our source code is released at https://github.com/awslabs/multi-table-benchmark .
Paper Structure (63 sections, 1 theorem, 21 equations, 11 figures, 10 tables)

This paper contains 63 sections, 1 theorem, 21 equations, 11 figures, 10 tables.

Key Result

proposition 1

Let ${\mathcal{G}}$ denote a heterogeneous graph and ${\mathcal{A}}$ a mapping that converts ${\mathcal{G}}$ to a degenerate single table RDB as described above. Furthermore, let $\hbox{Norm}$ denote an operator that normalizes an RDB according to the first through forth database normal forms.We rem

Figures (11)

  • Figure 1: 4DBInfer exploration dimensions. Unlike prior benchmarking efforts (table columns on right), 4DBInfer considers an evaluation space with diversity across the 4D Cartesian product of (i) datasets, (ii) tasks, (iii) graph extractors, and (iv) predictive baselines. See Sections \ref{['sec:baselines_design_space']} and \ref{['sec:rdb_benchmarks']} (and in particular Section \ref{['sec:RDB_choices']}) for further details of table properties and assumptions.
  • Figure 2: 4DBInfer overview. Left: First a (i) RDB dataset and (ii) task (i.e., predictive target here) are selected from among proposed benchmarks. Middle: Then a (iii) graph extractor/sampling operator is chosen which converts the RDB and task into subgraph chunks (middle). Right: Lastly a (iv) predictive model ingests these chunks, either through early or late feature fusion, to produce an estimate of the target values (right).
  • Figure 3: Schema graph for the AVS dataset.
  • Figure 4: Schema graph for the Outbrain dataset.
  • Figure 5: Schema graph for the Diginetica dataset.
  • ...and 6 more figures

Theorems & Definitions (2)

  • definition 1
  • proposition 1