A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning
Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux
TL;DR
The paper introduces a supervised filter for feature selection based on the Gumbel copula upper-tail dependence coefficient $\lambda_U$, identifying predictors that co-occur with extreme diabetes risk. By converting variables to pseudo-observations and mapping Kendall's $\tau$ to $\theta$ and then $\lambda_U$, the method prioritizes features with strong joint extremes to the positive class, and bootstrap CIs validate the tail-signal. Evaluations on the large CDC health indicators dataset (N=253,680) and the clinical PIMA dataset (N=768) show that the Gumbel-$\lambda_U$ selector is fast, reduces dimensionality (CDC by ~52%), and yields competitive or superior ROC-AUC compared with MI, mRMR, and ReliefF, while performing ranking-only on PIMA with no significant AUC differences. The approach yields clinically coherent top predictors and demonstrates robustness to noise and missing data, offering a practical, interpretable tool for public health screening and clinical risk modeling. The work also outlines extensions to other tail-copula families, potential applications across omics and imaging domains, and considerations for deployment, calibration, and fairness in real-world settings.
Abstract
Effective feature selection is vital for robust and interpretable medical prediction, especially for identifying risk factors concentrated in extreme patient strata. Standard methods emphasize average associations and may miss predictors whose importance lies in the tails of the distribution. We propose a computationally efficient supervised filter that ranks features using the Gumbel copula upper tail dependence coefficient ($λ_U$), prioritizing variables that are simultaneously extreme with the positive class. We benchmarked against Mutual Information, mRMR, ReliefF, and $L_1$ Elastic Net across four classifiers on two diabetes datasets: a large public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Evaluation included paired statistical tests, permutation importance, and robustness checks with label flips, feature noise, and missingness. On CDC, our method was the fastest selector and reduced the feature space by about 52% while retaining strong discrimination. Although using all 21 features yielded the highest AUC, our filter significantly outperformed Mutual Information and mRMR and was statistically indistinguishable from ReliefF. On PIMA, with only eight predictors, our ranking produced the numerically highest ROC AUC, and no significant differences were found versus strong baselines. Across both datasets, the upper tail criterion consistently identified clinically coherent, impactful predictors. We conclude that copula based feature selection via upper tail dependence is a powerful, efficient, and interpretable approach for building risk models in public health and clinical medicine.
