Table of Contents
Fetching ...

WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos

Zhaomin Wu, Ziyang Wang, Bingsheng He

TL;DR

WikiDBGraph addresses the gap between theory and practice in collaborative learning across data silos by building a large-scale graph of 100,000 relational databases linked through 17 million weighted edges. It jointly learns database correlations via a contrastive embedding model, constructs a correlated graph with 13 node and 12 edge properties, and provides an automated data-mining pipeline to evaluate CL methods end-to-end. The study shows that real-world data are highly interconnected with hybrid alignment and that full table joins are often infeasible, leading to mixed CL gains and highlighting preprocessing as a critical bottleneck. The dataset and accompanying benchmark suite enable targeted research into schema matching, instance alignment, and graph-aware CL, with practical implications for deploying privacy-preserving cross-database learning in industry settings.

Abstract

Relational databases are often fragmented across organizations, creating data silos that hinder distributed data management and mining. Collaborative learning (CL) -- techniques that enable multiple parties to train models jointly without sharing raw data -- offers a principled approach to this challenge. However, existing CL frameworks (e.g., federated and split learning) remain limited in real-world deployments. Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases, and they typically neglect the end-to-end data management pipeline, especially preprocessing steps such as table joins and data alignment. In contrast, our analysis of the real-world corpus WikiDBs shows that databases are interconnected, unaligned, and sometimes unjoinable, exposing a significant gap between CL algorithm design and practical deployment. To close this evaluation gap, we build WikiDBGraph, a large-scale dataset constructed from 100{,}000 real-world relational databases linked by 17 million weighted edges. Each node (database) and edge (relationship) is annotated with 13 and 12 properties, respectively, capturing a hybrid of instance- and feature-level overlap across databases. Experiments on WikiDBGraph demonstrate both the effectiveness and limitations of existing CL methods under realistic conditions, highlighting previously overlooked gaps in managing real-world data silos and pointing to concrete directions for practical deployment of collaborative learning systems.

WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos

TL;DR

WikiDBGraph addresses the gap between theory and practice in collaborative learning across data silos by building a large-scale graph of 100,000 relational databases linked through 17 million weighted edges. It jointly learns database correlations via a contrastive embedding model, constructs a correlated graph with 13 node and 12 edge properties, and provides an automated data-mining pipeline to evaluate CL methods end-to-end. The study shows that real-world data are highly interconnected with hybrid alignment and that full table joins are often infeasible, leading to mixed CL gains and highlighting preprocessing as a critical bottleneck. The dataset and accompanying benchmark suite enable targeted research into schema matching, instance alignment, and graph-aware CL, with practical implications for deploying privacy-preserving cross-database learning in industry settings.

Abstract

Relational databases are often fragmented across organizations, creating data silos that hinder distributed data management and mining. Collaborative learning (CL) -- techniques that enable multiple parties to train models jointly without sharing raw data -- offers a principled approach to this challenge. However, existing CL frameworks (e.g., federated and split learning) remain limited in real-world deployments. Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases, and they typically neglect the end-to-end data management pipeline, especially preprocessing steps such as table joins and data alignment. In contrast, our analysis of the real-world corpus WikiDBs shows that databases are interconnected, unaligned, and sometimes unjoinable, exposing a significant gap between CL algorithm design and practical deployment. To close this evaluation gap, we build WikiDBGraph, a large-scale dataset constructed from 100{,}000 real-world relational databases linked by 17 million weighted edges. Each node (database) and edge (relationship) is annotated with 13 and 12 properties, respectively, capturing a hybrid of instance- and feature-level overlap across databases. Experiments on WikiDBGraph demonstrate both the effectiveness and limitations of existing CL methods under realistic conditions, highlighting previously overlooked gaps in managing real-world data silos and pointing to concrete directions for practical deployment of collaborative learning systems.

Paper Structure

This paper contains 38 sections, 4 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example of collaborative learning
  • Figure 2: The overview of WikiDBGraph construction process
  • Figure 3: Performance evaluation of the embedding model
  • Figure 4: Distribution of Graph Properties in WikiDBGraph
  • Figure 5: Performance gain distributions ($\Delta$) of different CL algorithms over the Solo baseline across 2,000 database pairs
  • ...and 2 more figures