Table of Contents
Fetching ...

Relatron: Automating Relational Machine Learning over Relational Databases

Zhikai Chen, Han Xie, Jian Zhang, Jiliang Tang, Xiang Song, Huzefa Rangwala

TL;DR

This study presents Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search and proposes Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search.

Abstract

Predictive modeling over relational databases (RDBs) powers applications, yet remains challenging due to capturing both cross-table dependencies and complex feature interactions. Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood. We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and (3) validation accuracy is an unreliable guide for architecture choice. This search yields a model performance bank that links architecture configurations to their performance; leveraging this bank, we analyze the drivers of the RDL-DFS performance gap and introduce two task signals -- RDB task homophily and an affinity embedding that captures size, path, feature, and temporal structure -- whose correlation with the gap enables principled routing. Guided by these signals, we propose Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search. Lightweight loss-landscape metrics further guard against brittle checkpoints by preferring flatter optima. In experiments, Relatron resolves the "more tuning, worse performance" effect and, in joint hyperparameter-architecture optimization, achieves up to 18.5% improvement over strong baselines with 10x lower cost than Fisher information-based alternatives.

Relatron: Automating Relational Machine Learning over Relational Databases

TL;DR

This study presents Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search and proposes Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search.

Abstract

Predictive modeling over relational databases (RDBs) powers applications, yet remains challenging due to capturing both cross-table dependencies and complex feature interactions. Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood. We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and (3) validation accuracy is an unreliable guide for architecture choice. This search yields a model performance bank that links architecture configurations to their performance; leveraging this bank, we analyze the drivers of the RDL-DFS performance gap and introduce two task signals -- RDB task homophily and an affinity embedding that captures size, path, feature, and temporal structure -- whose correlation with the gap enables principled routing. Guided by these signals, we propose Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search. Lightweight loss-landscape metrics further guard against brittle checkpoints by preferring flatter optima. In experiments, Relatron resolves the "more tuning, worse performance" effect and, in joint hyperparameter-architecture optimization, achieves up to 18.5% improvement over strong baselines with 10x lower cost than Fisher information-based alternatives.
Paper Structure (52 sections, 15 theorems, 53 equations, 7 figures, 12 tables)

This paper contains 52 sections, 15 theorems, 53 equations, 7 figures, 12 tables.

Key Result

Lemma 1

For all $s\in\mathbb R$ and $m\in\mathcal{M}$,

Figures (7)

  • Figure 1: (a) An example of generating the task table from an RDB. The label is based on whether a student has achieved an A+ in a course before a specific timestamp. (b) Another example demonstrating the working process of DFS and RDL. For DFS, a predefined set of aggregation functions, such as MEAN and COUNT, is used to aggregate information across multiple tables based on key relationships into a final data table. For comparison, RDL is claimed to replace the manual aggregation design with an automatic message-passing-based sparse attention.
  • Figure 2: Performance comparison between the best configurations from our design space and baseline models on entity-level tasks. "Best (ours)" means the better value of RDL and DFS. Full numerical results can be seen in Table \ref{['tab: full-figure31']} from Appendix \ref{['app: sup-exp']}.
  • Figure 3: Augmenting rel-f1 databases. We should treat the set of FKs as a hyperedge (each pair of FKs is appended to the original PK-FK graphs as a new edge type) rather than relying solely on PK-FK edges.
  • Figure 4: Ground truth GraphGym similarity
  • Figure 5: HPO results for recommendation tasks.
  • ...and 2 more figures

Theorems & Definitions (33)

  • Definition 1: RDB task homophily
  • Definition 2: Class-insensitive homophily
  • Definition 3: Aggregation homophily
  • Definition 4: Metapath-wise contextual SBM (tCSBM)
  • Remark 1: Metapath-induced graphs as the substrate for DFS and RDL
  • Lemma 1: Gate-off, flip, and linear region
  • proof
  • Remark 2: Interpretation of Lemma \ref{['lem:properties']}
  • Proposition 1: Vector form on the metapath-projected $\mathsf F$--$\mathsf F$ graph.
  • proof
  • ...and 23 more