Table of Contents
Fetching ...

Detection of Interacting Variables for Generalized Linear Models via Neural Networks

Yevhen Havrylenko, Julia Heger

TL;DR

The work tackles automating the discovery of useful variable interactions to improve GLMs for insurance claim frequencies by introducing a Combined Actuarial Neural Network (CANN) that boosts a fixed benchmark GLM, followed by a model-specific Neural Interaction Detection (NID) to rank learned interactions, and mini-GLMs to identify the next-best interaction. The method is validated on an artificially generated dataset with known true interactions and on the freMTPL2freq open dataset, showing competitive performance and substantial speed advantages over traditional H-statistic approaches. It also demonstrates practical handling of high-cardinality categoricals via embedding layers and outlines a workflow suitable for large-scale MTPL data, where timely iterative refinement of GLMs is valuable for actuaries. Overall, the approach offers a data-driven, interpretable, and scalable path to progressively enhance GLMs without overhauling existing pricing structures.

Abstract

The quality of generalized linear models (GLMs), frequently used by insurance companies, depends on the choice of interacting variables. The search for interactions is time-consuming, especially for data sets with a large number of variables, depends much on expert judgement of actuaries, and often relies on visual performance indicators. Therefore, we present an approach to automating the process of finding interactions that should be added to GLMs to improve their predictive power. Our approach relies on neural networks and a model-specific interaction detection method, which is computationally faster than the traditionally used methods like Friedman H-Statistic or SHAP values. In numerical studies, we provide the results of our approach on artificially generated data as well as open-source data.

Detection of Interacting Variables for Generalized Linear Models via Neural Networks

TL;DR

The work tackles automating the discovery of useful variable interactions to improve GLMs for insurance claim frequencies by introducing a Combined Actuarial Neural Network (CANN) that boosts a fixed benchmark GLM, followed by a model-specific Neural Interaction Detection (NID) to rank learned interactions, and mini-GLMs to identify the next-best interaction. The method is validated on an artificially generated dataset with known true interactions and on the freMTPL2freq open dataset, showing competitive performance and substantial speed advantages over traditional H-statistic approaches. It also demonstrates practical handling of high-cardinality categoricals via embedding layers and outlines a workflow suitable for large-scale MTPL data, where timely iterative refinement of GLMs is valuable for actuaries. Overall, the approach offers a data-driven, interpretable, and scalable path to progressively enhance GLMs without overhauling existing pricing structures.

Abstract

The quality of generalized linear models (GLMs), frequently used by insurance companies, depends on the choice of interacting variables. The search for interactions is time-consuming, especially for data sets with a large number of variables, depends much on expert judgement of actuaries, and often relies on visual performance indicators. Therefore, we present an approach to automating the process of finding interactions that should be added to GLMs to improve their predictive power. Our approach relies on neural networks and a model-specific interaction detection method, which is computationally faster than the traditionally used methods like Friedman H-Statistic or SHAP values. In numerical studies, we provide the results of our approach on artificially generated data as well as open-source data.
Paper Structure (19 sections, 14 equations, 6 figures, 9 tables)

This paper contains 19 sections, 14 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Architecture of a CANN model
  • Figure 2: Example of the NN part of a CANN model that uses a $2$-dimensional embedding layer (in light blue) encoding a categorical feature $\tilde{x}_{\cdot, 7}$.
  • Figure 3: Lift plots.
  • Figure 4: Generation of interactions in the first hidden layer and propagation of these interactions through the network. Figure adapted from tsang2017detecting.
  • Figure 5: Illustration of the interaction-strength calculation: evaluate all neurons of the first hidden layer by measuring the in-going and outgoing paths and then aggregate the results.
  • ...and 1 more figures