Table of Contents
Fetching ...

Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

Abdoulaye Sakho, Emmanuel Malherbe, Carl-Erik Gauthier, Erwan Scornet

TL;DR

The paper tackles class imbalance in binary tabular data with mixed features, a common challenge in banking tasks where $Y\in\{0,1\}$ and features include both continuous and categorical variables. It introduces MGS-GRF, an oversampling method that first synthesizes continuous features with a multivariate Gaussian KDE (MGS) and then generates categorical features with a Generalized Random Forest (GRF), ensuring coherence (generated categories are observed in the minority) and association (preserving dependence between continuous and categorical features). The authors formalize two quality notions—coherence and association—and show both correlate with predictive performance across simulations and real banking datasets, with MGS-GRF often delivering the best results. They demonstrate robustness to high-dimensional noise and demonstrate practical applicability in private banking with competitive or superior performance and regulatory-compliant pipelines. Overall, MGS-GRF provides a principled, model-agnostic approach for realistic mixed-feature oversampling that improves downstream risk-scoring tasks.

Abstract

This study investigates rare event detection on tabular data within binary classification. Standard techniques to handle class imbalance include SMOTE, which generates synthetic samples from the minority class. However, SMOTE is intrinsically designed for continuous input variables. In fact, despite SMOTE-NC-its default extension to handle mixed features (continuous and categorical variables)-very few works propose procedures to synthesize mixed features. On the other hand, many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest. Empirically, contrary to SMOTE-NC, we show that MGS-GRF exhibits two important properties: (i) the coherence i.e. the ability to only generate combinations of categorical features that are already present in the original dataset and (ii) association, i.e. the ability to preserve the dependence between continuous and categorical features. We also evaluate the predictive performances of LightGBM classifiers trained on data sets, augmented with synthetic samples from various strategies. Our comparison is performed on simulated and public real-world data sets, as well as on a private data set from a leading financial institution. We observe that synthetic procedures that have the properties of coherence and association display better predictive performances in terms of various predictive metrics (PR and ROC AUC...), with MGS-GRF being the best one. Furthermore, our method exhibits promising results for the private banking application, with development pipeline being compliant with regulatory constraints.

Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

TL;DR

The paper tackles class imbalance in binary tabular data with mixed features, a common challenge in banking tasks where and features include both continuous and categorical variables. It introduces MGS-GRF, an oversampling method that first synthesizes continuous features with a multivariate Gaussian KDE (MGS) and then generates categorical features with a Generalized Random Forest (GRF), ensuring coherence (generated categories are observed in the minority) and association (preserving dependence between continuous and categorical features). The authors formalize two quality notions—coherence and association—and show both correlate with predictive performance across simulations and real banking datasets, with MGS-GRF often delivering the best results. They demonstrate robustness to high-dimensional noise and demonstrate practical applicability in private banking with competitive or superior performance and regulatory-compliant pipelines. Overall, MGS-GRF provides a principled, model-agnostic approach for realistic mixed-feature oversampling that improves downstream risk-scoring tasks.

Abstract

This study investigates rare event detection on tabular data within binary classification. Standard techniques to handle class imbalance include SMOTE, which generates synthetic samples from the minority class. However, SMOTE is intrinsically designed for continuous input variables. In fact, despite SMOTE-NC-its default extension to handle mixed features (continuous and categorical variables)-very few works propose procedures to synthesize mixed features. On the other hand, many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest. Empirically, contrary to SMOTE-NC, we show that MGS-GRF exhibits two important properties: (i) the coherence i.e. the ability to only generate combinations of categorical features that are already present in the original dataset and (ii) association, i.e. the ability to preserve the dependence between continuous and categorical features. We also evaluate the predictive performances of LightGBM classifiers trained on data sets, augmented with synthetic samples from various strategies. Our comparison is performed on simulated and public real-world data sets, as well as on a private data set from a leading financial institution. We observe that synthetic procedures that have the properties of coherence and association display better predictive performances in terms of various predictive metrics (PR and ROC AUC...), with MGS-GRF being the best one. Furthermore, our method exhibits promising results for the private banking application, with development pipeline being compliant with regulatory constraints.

Paper Structure

This paper contains 26 sections, 12 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: $PR \; AUC$ of coherence simulations. Points color reflect their $Coh$ value.
  • Figure 2: Association experiments in high dimensional setting with noisy features.
  • Figure 3: Pipeline of private data set described in \ref{['sec:exp-data-real']}.

Theorems & Definitions (2)

  • definition thmcounterdefinition
  • definition thmcounterdefinition