Table of Contents
Fetching ...

A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning

Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

TL;DR

The paper introduces a supervised filter for feature selection based on the Gumbel copula upper-tail dependence coefficient $\lambda_U$, identifying predictors that co-occur with extreme diabetes risk. By converting variables to pseudo-observations and mapping Kendall's $\tau$ to $\theta$ and then $\lambda_U$, the method prioritizes features with strong joint extremes to the positive class, and bootstrap CIs validate the tail-signal. Evaluations on the large CDC health indicators dataset (N=253,680) and the clinical PIMA dataset (N=768) show that the Gumbel-$\lambda_U$ selector is fast, reduces dimensionality (CDC by ~52%), and yields competitive or superior ROC-AUC compared with MI, mRMR, and ReliefF, while performing ranking-only on PIMA with no significant AUC differences. The approach yields clinically coherent top predictors and demonstrates robustness to noise and missing data, offering a practical, interpretable tool for public health screening and clinical risk modeling. The work also outlines extensions to other tail-copula families, potential applications across omics and imaging domains, and considerations for deployment, calibration, and fairness in real-world settings.

Abstract

Effective feature selection is vital for robust and interpretable medical prediction, especially for identifying risk factors concentrated in extreme patient strata. Standard methods emphasize average associations and may miss predictors whose importance lies in the tails of the distribution. We propose a computationally efficient supervised filter that ranks features using the Gumbel copula upper tail dependence coefficient ($λ_U$), prioritizing variables that are simultaneously extreme with the positive class. We benchmarked against Mutual Information, mRMR, ReliefF, and $L_1$ Elastic Net across four classifiers on two diabetes datasets: a large public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Evaluation included paired statistical tests, permutation importance, and robustness checks with label flips, feature noise, and missingness. On CDC, our method was the fastest selector and reduced the feature space by about 52% while retaining strong discrimination. Although using all 21 features yielded the highest AUC, our filter significantly outperformed Mutual Information and mRMR and was statistically indistinguishable from ReliefF. On PIMA, with only eight predictors, our ranking produced the numerically highest ROC AUC, and no significant differences were found versus strong baselines. Across both datasets, the upper tail criterion consistently identified clinically coherent, impactful predictors. We conclude that copula based feature selection via upper tail dependence is a powerful, efficient, and interpretable approach for building risk models in public health and clinical medicine.

A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning

TL;DR

The paper introduces a supervised filter for feature selection based on the Gumbel copula upper-tail dependence coefficient , identifying predictors that co-occur with extreme diabetes risk. By converting variables to pseudo-observations and mapping Kendall's to and then , the method prioritizes features with strong joint extremes to the positive class, and bootstrap CIs validate the tail-signal. Evaluations on the large CDC health indicators dataset (N=253,680) and the clinical PIMA dataset (N=768) show that the Gumbel- selector is fast, reduces dimensionality (CDC by ~52%), and yields competitive or superior ROC-AUC compared with MI, mRMR, and ReliefF, while performing ranking-only on PIMA with no significant AUC differences. The approach yields clinically coherent top predictors and demonstrates robustness to noise and missing data, offering a practical, interpretable tool for public health screening and clinical risk modeling. The work also outlines extensions to other tail-copula families, potential applications across omics and imaging domains, and considerations for deployment, calibration, and fairness in real-world settings.

Abstract

Effective feature selection is vital for robust and interpretable medical prediction, especially for identifying risk factors concentrated in extreme patient strata. Standard methods emphasize average associations and may miss predictors whose importance lies in the tails of the distribution. We propose a computationally efficient supervised filter that ranks features using the Gumbel copula upper tail dependence coefficient (), prioritizing variables that are simultaneously extreme with the positive class. We benchmarked against Mutual Information, mRMR, ReliefF, and Elastic Net across four classifiers on two diabetes datasets: a large public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Evaluation included paired statistical tests, permutation importance, and robustness checks with label flips, feature noise, and missingness. On CDC, our method was the fastest selector and reduced the feature space by about 52% while retaining strong discrimination. Although using all 21 features yielded the highest AUC, our filter significantly outperformed Mutual Information and mRMR and was statistically indistinguishable from ReliefF. On PIMA, with only eight predictors, our ranking produced the numerically highest ROC AUC, and no significant differences were found versus strong baselines. Across both datasets, the upper tail criterion consistently identified clinically coherent, impactful predictors. We conclude that copula based feature selection via upper tail dependence is a powerful, efficient, and interpretable approach for building risk models in public health and clinical medicine.

Paper Structure

This paper contains 28 sections, 16 equations, 6 figures, 12 tables, 2 algorithms.

Figures (6)

  • Figure 1: ROC curves for GB across feature sets on CDC (AUCs in legend); Gumbel–$\lambda_U$ closely tracks the top performers.
  • Figure 2: CDC: top-10 Gumbel $\lambda_U$ with 95% bootstrap confidence intervals ($B{=}1000$).
  • Figure 3: Permutation importance (mean $\Delta$ROC-AUC over 500 permutations) for Gradient Boosting on the Gumbel top-10 features (CDC test set). Larger values indicate greater importance.
  • Figure 4: Superimposed ROC curves for Random Forest on PIMA across feature sets. Gumbel-$\lambda_U$ yields the top AUC (0.867); all others are close, underscoring minimal sensitivity to feature set size on this 8-variable benchmark.
  • Figure 5: PIMA: Top-8 $\lambda_U$ with 95% bootstrap CIs ($B{=}1000$).
  • ...and 1 more figures