Boosting gets full Attention for Relational Learning
Mathieu Guillame-Bert, Richard Nock
TL;DR
This work tackles learning from relational data described by multiple interrelated tables, a setting where traditional tabular approaches struggle to exploit topology. It introduces Relational GBDT, a two-pass gradient-boosted framework that integrates a schema-aware attention mechanism to propagate relational signals across tables, enabling a final tree to be learned on a complete feature set. The method combines a top-down stage of simple table-wise models with a bottom-up attention-based aggregation, and supports non-differentiable weak learners, offering interpretability through feature and relational signal insights. Empirically, Relational GBDT achieves competitive or superior performance across synthetic and real-world relational datasets (synthetic, financial, mutagenesis, arXiv, SST2), often surpassing flattening baselines and traditional tree- or neural-based methods, with notable interpretability advantages for understanding cross-table dependencies.
Abstract
More often than not in benchmark supervised ML, tabular data is flat, i.e. consists of a single $m \times d$ (rows, columns) file, but cases abound in the real world where observations are described by a set of tables with structural relationships. Neural nets-based deep models are a classical fit to incorporate general topological dependence among description features (pixels, words, etc.), but their suboptimality to tree-based models on tabular data is still well documented. In this paper, we introduce an attention mechanism for structured data that blends well with tree-based models in the training context of (gradient) boosting. Each aggregated model is a tree whose training involves two steps: first, simple tabular models are learned descending tables in a top-down fashion with boosting's class residuals on tables' features. Second, what has been learned progresses back bottom-up via attention and aggregation mechanisms, progressively crafting new features that complete at the end the set of observation features over which a single tree is learned, boosting's iteration clock is incremented and new class residuals are computed. Experiments on simulated and real-world domains display the competitiveness of our method against a state of the art containing both tree-based and neural nets-based models.
