Table of Contents
Fetching ...

RDBLearn: Simple In-Context Prediction Over Relational Databases

Yanlin Zhang, Linjie Xu, Quan Gan, David Wipf, Minjie Wang

TL;DR

RDBLearn tackles the challenge of performing predictive tasks over multi-table relational databases within an in-context learning (ICL) framework. It proposes a simple two-stage recipe: 1) relational featurization that deterministically aggregates information from an entity’s relational neighborhood into fixed-size features $u=g(\mathcal{N}_{RDB}(x))$, and 2) applying a tabular ICL backbone to the augmented row representation $z=[x;u]$ using a small labeled set $\mathcal{D}_{ICL}$. The authors implement this as an open-source toolkit with a scikit-learn–style interface, RDBLearn, and demonstrate strong, robust performance on RelBench and 4DBInfer across classification and regression tasks, often rivaling or surpassing supervised baselines and other foundation-model-based approaches. Key findings include that simple relational featurization plus a strong tabular ICL model can closely match more complex relational encoders, while offering substantial efficiency gains due to lack of gradient-based training. The work provides practical impact by enabling scalable, reproducible relational prediction with minimal architectural overhead, and it outlines a clear path for future enhancements in benchmarks, theory, and richer relational modeling within ICL. Mathematical highlights include representing the RDB as $RDB=(\mathcal{T},\mathcal{R})$, the neighborhood $\mathcal{N}_{RDB}(x)$, and the augmented representation $z=[x;g(\mathcal{N}_{RDB}(x))]$, with predictions derived from $y_{test}=f_\theta(z_{test},\mathcal{D}_{ICL})$ (and extensions that condition on $\mathrm{RDB}$ via $\mathcal{D}_{ICL}$).

Abstract

Recent advances in tabular in-context learning (ICL) show that a single pretrained model can adapt to new prediction tasks from a small set of labeled examples, avoiding per-task training and heavy tuning. However, many real-world tasks live in relational databases, where predictive signal is spread across multiple linked tables rather than a single flat table. We show that tabular ICL can be extended to relational prediction with a simple recipe: automatically featurize each target row using relational aggregations over its linked records, materialize the resulting augmented table, and run an off-the-shelf tabular foundation model on it. We package this approach in \textit{RDBLearn} (https://github.com/HKUSHXLab/rdblearn), an easy-to-use toolkit with a scikit-learn-style estimator interface that makes it straightforward to swap different tabular ICL backends; a complementary agent-specific interface is provided as well. Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best-performing foundation model approach we evaluate, at times even outperforming strong supervised baselines trained or fine-tuned on each dataset.

RDBLearn: Simple In-Context Prediction Over Relational Databases

TL;DR

RDBLearn tackles the challenge of performing predictive tasks over multi-table relational databases within an in-context learning (ICL) framework. It proposes a simple two-stage recipe: 1) relational featurization that deterministically aggregates information from an entity’s relational neighborhood into fixed-size features , and 2) applying a tabular ICL backbone to the augmented row representation using a small labeled set . The authors implement this as an open-source toolkit with a scikit-learn–style interface, RDBLearn, and demonstrate strong, robust performance on RelBench and 4DBInfer across classification and regression tasks, often rivaling or surpassing supervised baselines and other foundation-model-based approaches. Key findings include that simple relational featurization plus a strong tabular ICL model can closely match more complex relational encoders, while offering substantial efficiency gains due to lack of gradient-based training. The work provides practical impact by enabling scalable, reproducible relational prediction with minimal architectural overhead, and it outlines a clear path for future enhancements in benchmarks, theory, and richer relational modeling within ICL. Mathematical highlights include representing the RDB as , the neighborhood , and the augmented representation , with predictions derived from (and extensions that condition on via ).

Abstract

Recent advances in tabular in-context learning (ICL) show that a single pretrained model can adapt to new prediction tasks from a small set of labeled examples, avoiding per-task training and heavy tuning. However, many real-world tasks live in relational databases, where predictive signal is spread across multiple linked tables rather than a single flat table. We show that tabular ICL can be extended to relational prediction with a simple recipe: automatically featurize each target row using relational aggregations over its linked records, materialize the resulting augmented table, and run an off-the-shelf tabular foundation model on it. We package this approach in \textit{RDBLearn} (https://github.com/HKUSHXLab/rdblearn), an easy-to-use toolkit with a scikit-learn-style estimator interface that makes it straightforward to swap different tabular ICL backends; a complementary agent-specific interface is provided as well. Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best-performing foundation model approach we evaluate, at times even outperforming strong supervised baselines trained or fine-tuned on each dataset.
Paper Structure (35 sections, 7 equations, 5 figures, 4 tables)

This paper contains 35 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Relational featurization + single-table ICL workflow. Given a relational database (RDB) and a target table containing in-context examples and a query instance, relational featurization computes engineered features from each instance's relational neighborhood. These additional feature columns $u^{(1)}, u^{(2)}, ...$ are concatenated with the original input columns $x^{(1)}, x^{(2)}, ...$ to form an augmented table, and a single-table ICL predictor produces the final prediction $\hat{y}_{\mathrm{test}}$.
  • Figure 2: Concrete example of relational featurization. (a) A relational database (RDB) together with a target table for user churn, containing labeled in-context examples and a query instance. (b) The augmented target table after computing engineered features by aggregating linked records in the RDB, which can then be consumed by a single-table ICL model.
  • Figure 3: Minimal usage of RDBLearn on a RelBench task.
  • Figure 4: A use case of Claude Code evaluating a customized RDB prediction task on RDBLearn. Traces are compressed for clarity.
  • Figure 5: Main results on RelBench and 4DBInfer. Each row corresponds to a dataset family and task type (RelBench classification, RelBench regression, and 4DBInfer classification). Left: mean performance across tasks (AUC is shown in percentage points). Right: mean per-task rank across tasks (lower is better).