Table of Contents
Fetching ...

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang

TL;DR

Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, a RDB-PFN is introduced, the first relational foundation model trained purely via synthetic data, to create an infinite stream of diverse RDBs from scratch.

Abstract

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

TL;DR

Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, a RDB-PFN is introduced, the first relational foundation model trained purely via synthetic data, to create an infinite stream of diverse RDBs from scratch.

Abstract

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce , the first relational foundation model trained purely via . Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a to create an infinite stream of diverse RDBs from scratch. Pre-training on synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine . Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN
Paper Structure (65 sections, 9 theorems, 66 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 65 sections, 9 theorems, 66 equations, 7 figures, 10 tables, 1 algorithm.

Key Result

Lemma 4.5

Let $(X,Y)$ be random variables with $Y$ taking values in a standard Borel space (e.g., $\mathbb{R}^d$, any countable set, any finite set, or a product of these). Then there exists a Borel measurable function $f$ and an independent $U\sim\mathrm{Unif}(0,1)$ such that Moreover, if $\hat{f}_n$ is any sequence of measurable functions such that $\hat{f}_n(X,U)\to f(X,U)$ in probability under the join

Figures (7)

  • Figure 1: Overview of the RDB-PFN Framework. The top panel illustrates our Universal Relational Prior, which synthesizes diverse relational databases via a hierarchical decomposition: Schema (LayerDAG), Structure (Hybrid SCM), and Content (Hierarchical SCM). The bottom panel depicts the Two-Stage Curriculum Learning protocol, where the model first establishes a statistical backbone on single-table data before adapting to the complex topological signals of linearized relational data.
  • Figure 2: Resource Efficiency Frontier. Comparison of model complexity across Inference Latency (X-axis), Parameter Count (Y-axis), and Pre-training Data Volume (Bubble Size). Note that "Lite" baselines denote single-estimator configurations (ensembling disabled) to facilitate a direct architectural comparison. RDB-PFN (red star) dominates the efficiency landscape, achieving SOTA performance with 3x--8x faster inference, requiring only 2%--5% of the pre-training data, and utilizing less than 2% to 20% of the parameters compared to competing foundation models.
  • Figure 3: Relational Few-Shot Performance across Evaluation Protocols. We report aggregated normalized performance across 19 relational tasks (higher is better). (a) Single-Estimator Protocol: all baselines are constrained to one estimator (ensembling disabled). RDB-PFN clearly surpass all baselines. (b) Recommended-Default Protocol: baselines run with their official default inference pipelines (which may include test-time ensembling), while RDB-PFN remains a single forward-pass estimator. RDB-PFN maintains superior average performance while offering 3x -- 8x faster inference, positioning it on the optimal frontier of the efficiency-accuracy trade-off.
  • Figure 4: Single-Table Performance Analysis. We compare performance across (1) Classic ML Baselines, (2) Specialized Tabular Foundation Models, and (3) RDB-PFN Variants. While RDB-PFN slightly trails specialized single-table models (an expected trade-off given its broader structural scope), it consistently outperforms classical baselines. Crucially, the full RDB-PFN surpasses its own single-table-only variant. This confirms a distinct Positive Transfer effect: exposure to diverse relational structures enhances general tabular reasoning capabilities beyond what is achievable with single-table pretraining alone.
  • Figure 5: Visualizing Structural Correlations. Correlation heatmaps showing that while single-table data (Top Row) exhibits diffuse patterns, linearized RDBs (Bottom Row) display a distinct Block-Diagonal Structure. Our synthetic prior successfully reproduces this characteristic real-world topology.
  • ...and 2 more figures

Theorems & Definitions (26)

  • Definition 3.1: Relational Database: Schema and Instance
  • Definition 3.2: Source vs. Dependent Tables
  • Definition 3.3: Schema Graph and Instance Graph
  • Definition 3.4: Relation Types and Neighborhoods
  • Definition 3.5: Relational Prediction Task and In-Context Learning
  • Definition 4.1: Cell Instances and Values
  • Definition 4.2: Structural vs. Dependent Columns and Cell Instances
  • Definition 4.3: Structural Latent States
  • Lemma 4.5: Universal measurable conditional sampler + approximation transfer
  • proof : Proof sketch
  • ...and 16 more