RDBLearn: Simple In-Context Prediction Over Relational Databases
Yanlin Zhang, Linjie Xu, Quan Gan, David Wipf, Minjie Wang
TL;DR
RDBLearn tackles the challenge of performing predictive tasks over multi-table relational databases within an in-context learning (ICL) framework. It proposes a simple two-stage recipe: 1) relational featurization that deterministically aggregates information from an entity’s relational neighborhood into fixed-size features $u=g(\mathcal{N}_{RDB}(x))$, and 2) applying a tabular ICL backbone to the augmented row representation $z=[x;u]$ using a small labeled set $\mathcal{D}_{ICL}$. The authors implement this as an open-source toolkit with a scikit-learn–style interface, RDBLearn, and demonstrate strong, robust performance on RelBench and 4DBInfer across classification and regression tasks, often rivaling or surpassing supervised baselines and other foundation-model-based approaches. Key findings include that simple relational featurization plus a strong tabular ICL model can closely match more complex relational encoders, while offering substantial efficiency gains due to lack of gradient-based training. The work provides practical impact by enabling scalable, reproducible relational prediction with minimal architectural overhead, and it outlines a clear path for future enhancements in benchmarks, theory, and richer relational modeling within ICL. Mathematical highlights include representing the RDB as $RDB=(\mathcal{T},\mathcal{R})$, the neighborhood $\mathcal{N}_{RDB}(x)$, and the augmented representation $z=[x;g(\mathcal{N}_{RDB}(x))]$, with predictions derived from $y_{test}=f_\theta(z_{test},\mathcal{D}_{ICL})$ (and extensions that condition on $\mathrm{RDB}$ via $\mathcal{D}_{ICL}$).
Abstract
Recent advances in tabular in-context learning (ICL) show that a single pretrained model can adapt to new prediction tasks from a small set of labeled examples, avoiding per-task training and heavy tuning. However, many real-world tasks live in relational databases, where predictive signal is spread across multiple linked tables rather than a single flat table. We show that tabular ICL can be extended to relational prediction with a simple recipe: automatically featurize each target row using relational aggregations over its linked records, materialize the resulting augmented table, and run an off-the-shelf tabular foundation model on it. We package this approach in \textit{RDBLearn} (https://github.com/HKUSHXLab/rdblearn), an easy-to-use toolkit with a scikit-learn-style estimator interface that makes it straightforward to swap different tabular ICL backends; a complementary agent-specific interface is provided as well. Across a broad collection of RelBench and 4DBInfer datasets, RDBLearn is the best-performing foundation model approach we evaluate, at times even outperforming strong supervised baselines trained or fine-tuned on each dataset.
