Table of Contents
Fetching ...

Learning High-Order Interactions via Targeted Pattern Search

Michela C. Massi, Nicola R. Franco, Francesca Ieva, Andrea Manzoni, Anna Maria Paganoni, Paolo Zunino

TL;DR

The paper tackles the challenge of LR in binary classification when important predictive structure arises from high-order interactions among categorical features in imbalanced, wide datasets. It introduces Learning high-order Interactions via targeted Pattern Search (LIPS), a two-step approach combining minority-class pattern mining via frequent itemset methods and a novel dissimilarity-based selection to choose K diverse interactions for LR; two variants, Scores LIPS and Clusters LIPS, further enhance interpretability. Empirical results on simulated data and a Breast Cancer case study show that LIPS can achieve higher AUC with far fewer interaction terms than state-of-the-art baselines like glinternet, while maintaining robust performance across sample sizes and imbalance levels. The work offers a scalable, interpretable framework for high-order interaction learning in LR, with potential application in genomics and healthcare, and suggests directions for incorporating numerical covariates and alternative pattern mining techniques.

Abstract

Logistic Regression (LR) is a widely used statistical method in empirical binary classification studies. However, real-life scenarios oftentimes share complexities that prevent from the use of the as-is LR model, and instead highlight the need to include high-order interactions to capture data variability. This becomes even more challenging because of: (i) datasets growing wider, with more and more variables; (ii) studies being typically conducted in strongly imbalanced settings; (iii) samples going from very large to extremely small; (iv) the need of providing both predictive models and interpretable results. In this paper we present a novel algorithm, Learning high-order Interactions via targeted Pattern Search (LIPS), to select interaction terms of varying order to include in a LR model for an imbalanced binary classification task when input data are categorical. LIPS's rationale stems from the duality between item sets and categorical interactions. The algorithm relies on an interaction learning step based on a well-known frequent item set mining algorithm, and a novel dissimilarity-based interaction selection step that allows the user to specify the number of interactions to be included in the LR model. In addition, we particularize two variants (Scores LIPS and Clusters LIPS), that can address even more specific needs. Through a set of experiments we validate our algorithm and prove its wide applicability to real-life research scenarios, showing that it outperforms a benchmark state-of-the-art algorithm.

Learning High-Order Interactions via Targeted Pattern Search

TL;DR

The paper tackles the challenge of LR in binary classification when important predictive structure arises from high-order interactions among categorical features in imbalanced, wide datasets. It introduces Learning high-order Interactions via targeted Pattern Search (LIPS), a two-step approach combining minority-class pattern mining via frequent itemset methods and a novel dissimilarity-based selection to choose K diverse interactions for LR; two variants, Scores LIPS and Clusters LIPS, further enhance interpretability. Empirical results on simulated data and a Breast Cancer case study show that LIPS can achieve higher AUC with far fewer interaction terms than state-of-the-art baselines like glinternet, while maintaining robust performance across sample sizes and imbalance levels. The work offers a scalable, interpretable framework for high-order interaction learning in LR, with potential application in genomics and healthcare, and suggests directions for incorporating numerical covariates and alternative pattern mining techniques.

Abstract

Logistic Regression (LR) is a widely used statistical method in empirical binary classification studies. However, real-life scenarios oftentimes share complexities that prevent from the use of the as-is LR model, and instead highlight the need to include high-order interactions to capture data variability. This becomes even more challenging because of: (i) datasets growing wider, with more and more variables; (ii) studies being typically conducted in strongly imbalanced settings; (iii) samples going from very large to extremely small; (iv) the need of providing both predictive models and interpretable results. In this paper we present a novel algorithm, Learning high-order Interactions via targeted Pattern Search (LIPS), to select interaction terms of varying order to include in a LR model for an imbalanced binary classification task when input data are categorical. LIPS's rationale stems from the duality between item sets and categorical interactions. The algorithm relies on an interaction learning step based on a well-known frequent item set mining algorithm, and a novel dissimilarity-based interaction selection step that allows the user to specify the number of interactions to be included in the LR model. In addition, we particularize two variants (Scores LIPS and Clusters LIPS), that can address even more specific needs. Through a set of experiments we validate our algorithm and prove its wide applicability to real-life research scenarios, showing that it outperforms a benchmark state-of-the-art algorithm.

Paper Structure

This paper contains 25 sections, 18 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Geometric model behind simulated data. On the left (a) are two illustrative observations in the dataset, represented by two tiles from each squared tiling. On the right (b) represents the tiles that determine the positive class.
  • Figure 2: $K$ patterns identified by LIPS in one trial of the first simulation experiment (where $K=10$). Tiles are colored according to the areas defined by the categorical terms involved in each of the selected interactions. Red patterns are considered risk patterns ($OR>1$), while protection patterns ($OR<1$) are colored in blue.
  • Figure 3: In order, performance of LIPS in its three variants (red), against DS-LIPS with $supp_{min}=0.1$, DS-LIPS with $supp_{min}=0.5$ and TOP LIPS (blue) on simulated data.
  • Figure 4: Performance of LIPS, Scores LIPS and Clusters LIPS for varying sample sizes.
  • Figure 5: Performance of LIPS, Scores LIPS and Clusters LIPS for varying imbalance ratios. The percentages on the x-axis (reported in log-scale) represent the portion of minority class observations on the whole dataset.
  • ...and 1 more figures