PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models
Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi, Johannes Hoffart, Carlos Guestrin, Jure Leskovec
TL;DR
PluRel addresses the scarcity of public, diverse relational data for training Relational Foundation Models by introducing a three-stage synthetic data generator for multi-table databases. It builds schemas with directed graphs, models inter-table P→F connectivity via hierarchical stochastic block models, and generates table rows with table-specific Structural Causal Models that incorporate temporal dynamics. The authors demonstrate power-law scaling of pretraining loss with both the number of synthetic databases $N$ and the total tokens $S$, and show that larger, more diverse synthetic data improves zero-shot generalization to RelBench and enhances downstream performance when combined with continued real-data pretraining. This work suggests that synthetic data scaling can unlock scalable, privacy-preserving pretraining for RFMs, broadening access to diverse relational content and enabling robust enterprise-scale relational reasoning.
Abstract
Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.
