Table of Contents
Fetching ...

Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data

Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, Jure Leskovec

TL;DR

The Relational Transformer tackles the lack of foundation-model capabilities for relational databases by introducing cell-level tokenization, task-table integration, and Relational Attention that explicitly encodes column, row, and foreign-key structure. Pretraining on RelBench enables strong zero-shot transfer to unseen datasets and tasks, with RT achieving about 93% of fully supervised AUROC on binary classification using 22M parameters and exhibiting superior data efficiency during fine-tuning. The approach outperforms larger LLM baselines under the same input regime and demonstrates robust schema-agnostic generalization across diverse relational schemas. This work provides a practical, scalable path toward foundation models for relational data, with significant implications for enterprise predictive analytics and rapid deployment across heterogeneous databases.

Abstract

Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 93% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT's zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.

Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data

TL;DR

The Relational Transformer tackles the lack of foundation-model capabilities for relational databases by introducing cell-level tokenization, task-table integration, and Relational Attention that explicitly encodes column, row, and foreign-key structure. Pretraining on RelBench enables strong zero-shot transfer to unseen datasets and tasks, with RT achieving about 93% of fully supervised AUROC on binary classification using 22M parameters and exhibiting superior data efficiency during fine-tuning. The approach outperforms larger LLM baselines under the same input regime and demonstrates robust schema-agnostic generalization across diverse relational schemas. This work provides a practical, scalable path toward foundation models for relational data, with significant implications for enterprise predictive analytics and rapid deployment across heterogeneous databases.

Abstract

Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 93% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT's zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.

Paper Structure

This paper contains 41 sections, 1 equation, 7 figures, 12 tables, 2 algorithms.

Figures (7)

  • Figure 1: (a) The schema specifies tables, columns, foreign keys and primary keys. The task definition is used to construct the task table, which includes labels one aims to predict (e.g., customer churn labels). (b) The context window captures relevant information to predict the label column of the target row, which is masked, excluding rows with later timestamps to prevent temporal leakage. (c) Cells correspond to tokens. Token embedding comprises trainable datatype-specific encoding of cell values and frozen language model (LM) embeddings of table/column names. Relational structure is modeled by our novel Relational Attention layers, where a cell attends to (1) cells in the same column (column attention), (2) cells in the same row and F$\to$P linked rows (feature attention), and (3) P$\to$F linked rows (neighbor attention).
  • Figure 2: RT can be pretrained on data with diverse schemas and task definitions. Pretrained RT is accurate on new datasets and tasks with zero-shot prompting. Dataset- and task-specific fine-tuning of pretrained RT shows high learning efficiency.
  • Figure 3: Test set learning curves up to 32k fine-tuning steps (8M training examples, including repetitions). Averaging is done over tasks which do not show overfitting. X-axis is on log-scale. The first point on each curve is the zero-shot performance. Target datasets and tasks are unseen during pretraining. Pretraining data is same for both RT and Griffin. Pretrained RT is best overall, and untrained RT catches up towards the end.
  • Figure 3: Mean AUROC (%) and R$^2$ (%) on ablating context window components for classification (clf) and regression (reg) tasks. Individual numbers are in App. \ref{['app:cont_ablations']}.
  • Figure 4: Pretrained RT shows transfer even without self labels. Setup is same as in Fig. \ref{['fig:learn-avg']}.
  • ...and 2 more figures