Table of Contents
Fetching ...

PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models

Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi, Johannes Hoffart, Carlos Guestrin, Jure Leskovec

TL;DR

PluRel addresses the scarcity of public, diverse relational data for training Relational Foundation Models by introducing a three-stage synthetic data generator for multi-table databases. It builds schemas with directed graphs, models inter-table P→F connectivity via hierarchical stochastic block models, and generates table rows with table-specific Structural Causal Models that incorporate temporal dynamics. The authors demonstrate power-law scaling of pretraining loss with both the number of synthetic databases $N$ and the total tokens $S$, and show that larger, more diverse synthetic data improves zero-shot generalization to RelBench and enhances downstream performance when combined with continued real-data pretraining. This work suggests that synthetic data scaling can unlock scalable, privacy-preserving pretraining for RFMs, broadening access to diverse relational content and enabling robust enterprise-scale relational reasoning.

Abstract

Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.

PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models

TL;DR

PluRel addresses the scarcity of public, diverse relational data for training Relational Foundation Models by introducing a three-stage synthetic data generator for multi-table databases. It builds schemas with directed graphs, models inter-table P→F connectivity via hierarchical stochastic block models, and generates table rows with table-specific Structural Causal Models that incorporate temporal dynamics. The authors demonstrate power-law scaling of pretraining loss with both the number of synthetic databases and the total tokens , and show that larger, more diverse synthetic data improves zero-shot generalization to RelBench and enhances downstream performance when combined with continued real-data pretraining. This work suggests that synthetic data scaling can unlock scalable, privacy-preserving pretraining for RFMs, broadening access to diverse relational content and enabling robust enterprise-scale relational reasoning.

Abstract

Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.
Paper Structure (28 sections, 11 equations, 5 figures, 6 tables, 3 algorithms)

This paper contains 28 sections, 11 equations, 5 figures, 6 tables, 3 algorithms.

Figures (5)

  • Figure 1: (Left) Pretraining loss $L$ scales as a power law with both (1) the number of synthetic databases $N$ and (2) the pretraining dataset size $S$, when not bottle-necked by the other. See Section \ref{['subsec:synthetic_scaling']} for details. (Right) On real-world predictive tasks, PluRel-based synthetic pretraining followed by continued pretraining on real data outperforms real data pretraining alone. See Section \ref{['subsec:synthetic_zs']} for details.
  • Figure 2: The PluRel framework. Stage 1 generates a schema by sampling a directed graph ${\mathcal{G}}$ and populating the metadata with row and column counts. In Stage 2, the foreign key columns are populated using a bipartite graph between rows of parent--child table pairs, each edge representing a primary--foreign key (P$\to$F) link. In Stage 3, we follow a topological ordering of tables in ${\mathcal{G}}$ and leverage Structural Causal Models (SCMs) conditioned on parent tables, with temporal patterns in source node inputs to populate the feature columns.
  • Figure 3: Synthesizing RDBs with PluRel results in diverse data distributions across feature column values.
  • Figure 4: Validation loss and zero-shot performance on RelBench tasks. The synthetic pretraining dataset sizes (in billions of tokens) are varied along with the number of PluRel RDBs to obtain the scaling curves. $(\downarrow)$/$(\uparrow)$ indicates that lower/higher values are better.
  • Figure 5: QK-Norm mitigates early overfitting with leave-one-db-out pretraining during the baseline runs and also improves the peak performance. AUROC $(\%)$ on the val/test splits of rel-stack/user-engagement(a, b) and rel-stack/user-badge(c, d) tasks highlights the mitigation of overfitting. R$^2 (\%)$ on the val/test splits of rel-stack/post-votes(e, f) and rel-f1/driver-position(g, h) tasks shows improvements to peak performance.

Theorems & Definitions (3)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3