Table of Contents
Fetching ...

Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection

Félix Vandervorst, Bruno Deprez, Wouter Verbeke, Tim Verdonck

TL;DR

This work tackles insurance fraud detection on heterogeneous and dynamic graphs by introducing G-GBM, an inductive gradient-boosted tree framework that leverages probability-weighted metapaths over ego-nets. By combining heterogeneous information networks with path-based feature representations and weighted splits, G-GBM achieves competitive or superior performance compared with GraphSAGE and HinSage across simulated and real-world datasets, while enabling explainability via SHAP analyses. The approach preserves the strengths of tree-based methods (handling of categorical features, missing values, and interpretability) and extends them to graph contexts without iterative neighborhood aggregation. Empirically, G-GBM demonstrates Pareto-dominant performance and practical utility for fraud detection in evolving networks, with open-source code provided to support reproducibility and extension.

Abstract

Graph-based methods are becoming increasingly popular in machine learning due to their ability to model complex data and relations. Insurance fraud is a prime use case, since false claims are often the result of organised criminals that stage accidents or the same persons filing erroneous claims on multiple policies. One challenge is that graph-based approaches struggle to find meaningful representations of the data because of the high class imbalance present in fraud data. Another is that insurance networks are heterogeneous and dynamic, given the changing relations among people, companies and policies. That is why gradient boosted tree approaches on tabular data still dominate the field. Therefore, we present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. We show that our estimator competes with popular graph neural network approaches in an experiment using a variety of simulated random graphs. We demonstrate the power of G-GBM for insurance fraud detection using an open-source and a real-world, proprietary dataset. Given that the backbone model is a gradient boosting forest, we apply established explainability methods to gain better insights into the predictions made by G-GBM.

Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection

TL;DR

This work tackles insurance fraud detection on heterogeneous and dynamic graphs by introducing G-GBM, an inductive gradient-boosted tree framework that leverages probability-weighted metapaths over ego-nets. By combining heterogeneous information networks with path-based feature representations and weighted splits, G-GBM achieves competitive or superior performance compared with GraphSAGE and HinSage across simulated and real-world datasets, while enabling explainability via SHAP analyses. The approach preserves the strengths of tree-based methods (handling of categorical features, missing values, and interpretability) and extends them to graph contexts without iterative neighborhood aggregation. Empirically, G-GBM demonstrates Pareto-dominant performance and practical utility for fraud detection in evolving networks, with open-source code provided to support reproducibility and extension.

Abstract

Graph-based methods are becoming increasingly popular in machine learning due to their ability to model complex data and relations. Insurance fraud is a prime use case, since false claims are often the result of organised criminals that stage accidents or the same persons filing erroneous claims on multiple policies. One challenge is that graph-based approaches struggle to find meaningful representations of the data because of the high class imbalance present in fraud data. Another is that insurance networks are heterogeneous and dynamic, given the changing relations among people, companies and policies. That is why gradient boosted tree approaches on tabular data still dominate the field. Therefore, we present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. We show that our estimator competes with popular graph neural network approaches in an experiment using a variety of simulated random graphs. We demonstrate the power of G-GBM for insurance fraud detection using an open-source and a real-world, proprietary dataset. Given that the backbone model is a gradient boosting forest, we apply established explainability methods to gain better insights into the predictions made by G-GBM.

Paper Structure

This paper contains 22 sections, 15 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: An ego-net $G^n_v$ of size n=2 centered on node $v$ with two node types: companies (orange) and administrators (green). The simple path set $P(v)^n$ (edges are omitted for compactness) is: $\{(v_0, v_1, v_6), (v_0, v_2,v_7), (v_0, v_3), (v_0,v_4,v_8),$$(v_0, v_4, v_9), (v_0,v_5,v_9)\}$
  • Figure 2: Different random graph model examples with 50 nodes. The features $x_i$ presented in the middle column are univariate and i.i.d. (independently and identically distributed), following a distribution $\mathcal{N}(0,1)$. The colors (red, orange, yellow) indicate quantiles of the distribution: $5\%$, $10\%$, and $20\%$, respectively, to represent "abnormal" values. The right-hand column presents the labels $y_i$, which are the top $10\%$ nodes whose 2-hop neighborhood average likelihood is the lowest (hence, $y_i$ are not independent).
  • Figure 3: Performance measurements for the test set of company node labels.
  • Figure 4: Variable importance in G-GBM: H is the head node, $N_q$ refers to the neighborhood of level $q$ relative to the head node, and (c) and (a) refer to the node type of the company and administrator, respectively.
  • Figure 5: Performance measurement for the test set of administrator nodes (fraud vs. legitimate)
  • ...and 3 more figures