Table of Contents
Fetching ...

A Systematic Evaluation Protocol of Graph-Derived Signals for Tabular Machine Learning

Mario Heidrich, Jeffrey Heidemann, Rüdiger Buchkremer, Gonzalo Wandosell Fernández de Bobadilla

Abstract

While graph-derived signals are widely used in tabular learning, existing studies typically rely on limited experimental setups and average performance comparisons, leaving the statistical reliability and robustness of observed gains largely unexplored. Consequently, it remains unclear which signals provide consistent and robust improvements. This paper presents a taxonomy-driven empirical analysis of graph-derived signals for tabular machine learning. We propose a unified and reproducible evaluation protocol to systematically assess which categories of graph-derived signals yield statistically significant and robust performance improvements. The protocol provides an extensible setup for the controlled integration of diverse graph-derived signals into tabular learning pipelines. To ensure a fair and rigorous comparison, it incorporates automated hyperparameter optimization, multi-seed statistical evaluation, formal significance testing, and robustness analysis under graph perturbations. We demonstrate the protocol through an extensive case study on a large-scale, imbalanced cryptocurrency fraud detection dataset. The analysis identifies signal categories providing consistently reliable performance gains and offers interpretable insights into which graph-derived signals indicate fraud-discriminative structural patterns. Furthermore, robustness analyses reveal pronounced differences in how various signals handle missing or corrupted relational data. These findings demonstrate practical utility for fraud detection and illustrate how the proposed taxonomy-driven evaluation protocol can be applied in other application domains.

A Systematic Evaluation Protocol of Graph-Derived Signals for Tabular Machine Learning

Abstract

While graph-derived signals are widely used in tabular learning, existing studies typically rely on limited experimental setups and average performance comparisons, leaving the statistical reliability and robustness of observed gains largely unexplored. Consequently, it remains unclear which signals provide consistent and robust improvements. This paper presents a taxonomy-driven empirical analysis of graph-derived signals for tabular machine learning. We propose a unified and reproducible evaluation protocol to systematically assess which categories of graph-derived signals yield statistically significant and robust performance improvements. The protocol provides an extensible setup for the controlled integration of diverse graph-derived signals into tabular learning pipelines. To ensure a fair and rigorous comparison, it incorporates automated hyperparameter optimization, multi-seed statistical evaluation, formal significance testing, and robustness analysis under graph perturbations. We demonstrate the protocol through an extensive case study on a large-scale, imbalanced cryptocurrency fraud detection dataset. The analysis identifies signal categories providing consistently reliable performance gains and offers interpretable insights into which graph-derived signals indicate fraud-discriminative structural patterns. Furthermore, robustness analyses reveal pronounced differences in how various signals handle missing or corrupted relational data. These findings demonstrate practical utility for fraud detection and illustrate how the proposed taxonomy-driven evaluation protocol can be applied in other application domains.
Paper Structure (49 sections, 4 equations, 12 figures, 2 tables)

This paper contains 49 sections, 4 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Average $F_1$-score improvements across graph signal categories relative to the transaction-only baseline, aggregated over classifiers and random seeds (trimmed aggregation).
  • Figure 2: Per-classifier $F_1$-scores across graph signal categories. Cell values report mean $F_1$-scores with standard deviations across random seeds. Color intensity indicates relative performance differences with respect to the transaction-only (TRX) baseline (green = improvement, red = degradation).
  • Figure 3: Aggregated McNemar test outcomes for graph signal categories across classifiers. Each cell represents the balance of statistically significant improvements ($p \leq 0.05$) versus degradations. Darker green shades indicate a higher frequency of significant performance gains over the transaction-only baseline.
  • Figure 4: Average $\Delta F_1$ relative to the transaction-only baseline across graph signal categories under increasing edge removal.
  • Figure 5: Mean $F_1$-scores across classifiers and graph signals. Cell values report the mean $F_1$-score aggregated over the trimmed middle eight runs. Color intensity indicates relative performance differences with respect to the transaction-only baseline (green = improvement, red = degradation).
  • ...and 7 more figures