Relatron: Automating Relational Machine Learning over Relational Databases

Zhikai Chen; Han Xie; Jian Zhang; Jiliang Tang; Xiang Song; Huzefa Rangwala

Relatron: Automating Relational Machine Learning over Relational Databases

Zhikai Chen, Han Xie, Jian Zhang, Jiliang Tang, Xiang Song, Huzefa Rangwala

TL;DR

This study presents Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search and proposes Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search.

Abstract

Predictive modeling over relational databases (RDBs) powers applications, yet remains challenging due to capturing both cross-table dependencies and complex feature interactions. Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood. We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and (3) validation accuracy is an unreliable guide for architecture choice. This search yields a model performance bank that links architecture configurations to their performance; leveraging this bank, we analyze the drivers of the RDL-DFS performance gap and introduce two task signals -- RDB task homophily and an affinity embedding that captures size, path, feature, and temporal structure -- whose correlation with the gap enables principled routing. Guided by these signals, we propose Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search. Lightweight loss-landscape metrics further guard against brittle checkpoints by preferring flatter optima. In experiments, Relatron resolves the "more tuning, worse performance" effect and, in joint hyperparameter-architecture optimization, achieves up to 18.5% improvement over strong baselines with 10x lower cost than Fisher information-based alternatives.

Relatron: Automating Relational Machine Learning over Relational Databases

TL;DR

Abstract

Paper Structure (52 sections, 15 theorems, 53 equations, 7 figures, 12 tables)

This paper contains 52 sections, 15 theorems, 53 equations, 7 figures, 12 tables.

Introduction
Related Work and Background
design space of model architectures over RDB
Predictive Tasks on RDBs
Model architecture design space
Empirical study of various architecture designs
Principles and automation of architecture selection
From observations to task embeddings
Data-centric perspective
Model-centric perspective
Automatic architecture selection through Relatron
Experimental evaluations
Conclusion, Limitations, and Future Discussion
Ethics Statement
Reproducibility Statement
...and 37 more sections

Key Result

Lemma 1

For all $s\in\mathbb R$ and $m\in\mathcal{M}$,

Figures (7)

Figure 1: (a) An example of generating the task table from an RDB. The label is based on whether a student has achieved an A+ in a course before a specific timestamp. (b) Another example demonstrating the working process of DFS and RDL. For DFS, a predefined set of aggregation functions, such as MEAN and COUNT, is used to aggregate information across multiple tables based on key relationships into a final data table. For comparison, RDL is claimed to replace the manual aggregation design with an automatic message-passing-based sparse attention.
Figure 2: Performance comparison between the best configurations from our design space and baseline models on entity-level tasks. "Best (ours)" means the better value of RDL and DFS. Full numerical results can be seen in Table \ref{['tab: full-figure31']} from Appendix \ref{['app: sup-exp']}.
Figure 3: Augmenting rel-f1 databases. We should treat the set of FKs as a hyperedge (each pair of FKs is appended to the original PK-FK graphs as a new edge type) rather than relying solely on PK-FK edges.
Figure 4: Ground truth GraphGym similarity
Figure 5: HPO results for recommendation tasks.
...and 2 more figures

Theorems & Definitions (33)

Definition 1: RDB task homophily
Definition 2: Class-insensitive homophily
Definition 3: Aggregation homophily
Definition 4: Metapath-wise contextual SBM (tCSBM)
Remark 1: Metapath-induced graphs as the substrate for DFS and RDL
Lemma 1: Gate-off, flip, and linear region
proof
Remark 2: Interpretation of Lemma \ref{['lem:properties']}
Proposition 1: Vector form on the metapath-projected $\mathsf F$--$\mathsf F$ graph.
proof
...and 23 more

Relatron: Automating Relational Machine Learning over Relational Databases

TL;DR

Abstract

Relatron: Automating Relational Machine Learning over Relational Databases

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (33)