RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

Justin Gu; Rishabh Ranjan; Charilaos Kanatsoulis; Haiming Tang; Martin Jurkovic; Valter Hudovernik; Mark Znidar; Pranshu Chaturvedi; Parth Shroff; Fengyu Li; Jure Leskovec

RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

Justin Gu, Rishabh Ranjan, Charilaos Kanatsoulis, Haiming Tang, Martin Jurkovic, Valter Hudovernik, Mark Znidar, Pranshu Chaturvedi, Parth Shroff, Fengyu Li, Jure Leskovec

TL;DR

Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks, highlighting the importance of modeling relational structure explicitly.

Abstract

Relational deep learning (RDL) has emerged as a powerful paradigm for learning directly on relational databases by modeling entities and their relationships across multiple interconnected tables. As this paradigm evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic evaluation and progress. In this paper, we introduce RelBench v2, a major expansion of the RelBench benchmark for RDL. RelBench v2 adds four large-scale relational datasets spanning scholarly publications, enterprise resource planning, consumer platforms, and clinical records, increasing the benchmark to 11 datasets comprising over 22 million rows across 29 tables. We further introduce autocomplete tasks, a new class of predictive objectives that require models to infer missing attribute values directly within relational tables while respecting temporal constraints, expanding beyond traditional forecasting tasks constructed via SQL queries. In addition, RelBench v2 expands beyond its native datasets by integrating external benchmarks and evaluation frameworks: we translate event streams from the Temporal Graph Benchmark into relational schemas for unified relational-temporal evaluation, interface with ReDeLEx to provide uniform access to 70+ real-world databases suitable for pretraining, and incorporate 4DBInfer datasets and tasks to broaden multi-table prediction coverage. Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks, highlighting the importance of modeling relational structure explicitly.

RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

TL;DR

Abstract

Paper Structure (36 sections, 5 figures, 25 tables, 1 algorithm)

This paper contains 36 sections, 5 figures, 25 tables, 1 algorithm.

Introduction
Overview and Design
RelBench Datasets
rel-arxiv
rel-salt
rel-ratebeer
rel-mimic
Autocomplete tasks
Autocomplete classification
Autocomplete regression
New forecasting tasks
Entity Classification
Entity Regression
Recommendation
Integrating External Benchmarks into RelBench
...and 21 more sections

Figures (5)

Figure 1: RelBench schema of the newly added Sales Autocompletion Linked Business Tables (SALT) dataset sap-salt.
Figure 2: RelBench schema of the newly added arXiv-physics dataset arxiv_physics_dataset.
Figure 3: RelBench schema of the newly added RateBeer dataset.
Figure 4: RelBench schema of the newly added MIMIC-IV v3.1 dataset johnson2024mimic.
Figure 5: Illustrative example of a real-world autocomplete task, where the SAP S/4HANA Sales Order User interface sap-salt predicts payment terms based on other filled-in response fields.

RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

TL;DR

Abstract

RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)