Relational Database Distillation: From Structured Tables to Condensed Graph Data
Xinyi Gao, Jingxi Zhang, Lijian Chen, Tong Chen, Lizhen Cui, Hongzhi Yin
TL;DR
Relational databases store data across interdependent tables but graph-based learning on these structures is expensive due to massive multi-hop message passing and storage. The authors introduce Relational Database Distillation (RDD) and the Table-to-Graph (T2G) framework to distill large RDBs into compact heterogeneous graphs while preserving predictive utility. T2G combines modality-specific tokenizers for multi-modal attributes, a clustering-based pretraining stage to generate pseudo-labels, a heterogeneous SBM to generate the distilled graph, and a KRR-based distillation objective to transfer predictive knowledge. Experiments on real-world RDBs demonstrate substantial data size reductions with competitive performance on classification and regression tasks, enabling scalable learning on large relational databases.
Abstract
Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances leverage graph representation learning to capture complex inter-table relations as multi-hop dependencies. Despite achieving state-of-the-art performance, these methods remain hindered by the prohibitive storage overhead and excessive training time, due to the massive scale of the database and the computational burden of intensive message passing across interconnected tables. To alleviate these concerns, we propose and study the problem of Relational Database Distillation (RDD). Specifically, we aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the predictive power (i.e., utility) required for training graph-based models. Multi-modal column information is preserved through node features, and primary-foreign key relations are encoded via heterogeneous edges, thereby maintaining both data fidelity and relational structure. To ensure adaptability across diverse downstream tasks without engaging the traditional, inefficient bi-level distillation framework, we further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph. Extensive experiments on multiple real-world RDBs demonstrate that our solution substantially reduces the data size while maintaining competitive performance on classification and regression tasks, creating an effective pathway for scalable learning with RDBs.
