Table of Contents
Fetching ...

Boosting gets full Attention for Relational Learning

Mathieu Guillame-Bert, Richard Nock

TL;DR

This work tackles learning from relational data described by multiple interrelated tables, a setting where traditional tabular approaches struggle to exploit topology. It introduces Relational GBDT, a two-pass gradient-boosted framework that integrates a schema-aware attention mechanism to propagate relational signals across tables, enabling a final tree to be learned on a complete feature set. The method combines a top-down stage of simple table-wise models with a bottom-up attention-based aggregation, and supports non-differentiable weak learners, offering interpretability through feature and relational signal insights. Empirically, Relational GBDT achieves competitive or superior performance across synthetic and real-world relational datasets (synthetic, financial, mutagenesis, arXiv, SST2), often surpassing flattening baselines and traditional tree- or neural-based methods, with notable interpretability advantages for understanding cross-table dependencies.

Abstract

More often than not in benchmark supervised ML, tabular data is flat, i.e. consists of a single $m \times d$ (rows, columns) file, but cases abound in the real world where observations are described by a set of tables with structural relationships. Neural nets-based deep models are a classical fit to incorporate general topological dependence among description features (pixels, words, etc.), but their suboptimality to tree-based models on tabular data is still well documented. In this paper, we introduce an attention mechanism for structured data that blends well with tree-based models in the training context of (gradient) boosting. Each aggregated model is a tree whose training involves two steps: first, simple tabular models are learned descending tables in a top-down fashion with boosting's class residuals on tables' features. Second, what has been learned progresses back bottom-up via attention and aggregation mechanisms, progressively crafting new features that complete at the end the set of observation features over which a single tree is learned, boosting's iteration clock is incremented and new class residuals are computed. Experiments on simulated and real-world domains display the competitiveness of our method against a state of the art containing both tree-based and neural nets-based models.

Boosting gets full Attention for Relational Learning

TL;DR

This work tackles learning from relational data described by multiple interrelated tables, a setting where traditional tabular approaches struggle to exploit topology. It introduces Relational GBDT, a two-pass gradient-boosted framework that integrates a schema-aware attention mechanism to propagate relational signals across tables, enabling a final tree to be learned on a complete feature set. The method combines a top-down stage of simple table-wise models with a bottom-up attention-based aggregation, and supports non-differentiable weak learners, offering interpretability through feature and relational signal insights. Empirically, Relational GBDT achieves competitive or superior performance across synthetic and real-world relational datasets (synthetic, financial, mutagenesis, arXiv, SST2), often surpassing flattening baselines and traditional tree- or neural-based methods, with notable interpretability advantages for understanding cross-table dependencies.

Abstract

More often than not in benchmark supervised ML, tabular data is flat, i.e. consists of a single (rows, columns) file, but cases abound in the real world where observations are described by a set of tables with structural relationships. Neural nets-based deep models are a classical fit to incorporate general topological dependence among description features (pixels, words, etc.), but their suboptimality to tree-based models on tabular data is still well documented. In this paper, we introduce an attention mechanism for structured data that blends well with tree-based models in the training context of (gradient) boosting. Each aggregated model is a tree whose training involves two steps: first, simple tabular models are learned descending tables in a top-down fashion with boosting's class residuals on tables' features. Second, what has been learned progresses back bottom-up via attention and aggregation mechanisms, progressively crafting new features that complete at the end the set of observation features over which a single tree is learned, boosting's iteration clock is incremented and new class residuals are computed. Experiments on simulated and real-world domains display the competitiveness of our method against a state of the art containing both tree-based and neural nets-based models.
Paper Structure (24 sections, 5 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Representation of the dataset schema with an oriented cycle.
  • Figure 2: Various basic patterns that can be represented with the dataset schema described in Section \ref{['sec:pb_statement']}, and that can be aggregated to create complex schemata.
  • Figure 3: A schedule compatible with the schema shown in Figure \ref{['fig:schema']}. Nodes are labelled as $n:t$ with $\bar{n} = t$. Arcs are labelled as $e:r$ with $\bar{e} = r$. $n_1$ is the root node i.e. $n_1 = u$ and $\bar{n}_1 = s = A$. The schedule covers the schema twice: Each table is mapped by two nodes e.g. nodes $n_1$ and $n_5$ maps to table $A$.
  • Figure 4: Dataset schemata of the four real-world domains used in the experiments.
  • Figure 5: On the synthetic dataset, average and one standard deviation range of the pseudo-labels on tables $A$, $B$, and $C$, on the forward and backward pass of the first training iteration. During the forward pass (a, b and c), $B^{\textrm{prop}}$ contains a single feature for each table. On table $A$, feature $p$ is partially discriminative showing an upward trend relation with a large error margin. On table $B$, feature $p'$ is not discriminative showing no relation between with the pseudo label, which complies with the fact that it does not appear in the label definition \ref{['eq:synthetic_label']}. On table $C$, feature $p"$ is partially discriminative with a slight downward trend relation. This shows that $p"$ is discriminative with respect to the residual of the pseudo label on table $B$, which is equal to the residual of the pseudo label on table $A$. Therefore, $p"$ can be selected by the hard attention mechanism $B^{\textrm{hard}}$ and forwarded to table $B$ and then $A$. During the backward pass on table $A$, feature $p$ and the hard attention $B^{\textrm{hard}}_A$ which is equal to the selected $p"$, are together highly discriminative showing very good prediction of $A$'s pseudo label (d) and a frontier that closely matches the domain's optimal separation, see \ref{['eq:synthetic_label']}.