Boundary-Aware Adversarial Filtering for Reliable Diagnosis under Extreme Class Imbalance
Yanxuan Yu, Michael S. Hughes, Julien Lee, Jiacheng Zhou, Andrew F. Laine
TL;DR
The paper tackles reliable diagnosis under extreme class imbalance, where missing true positives is dangerous and calibration matters. It proposes AF-SMOTE, which synthesizes minority samples via SMOTE-like interpolation and then filters them through adversarial realism and boundary-utility scoring, combining scores as $S(x)=\lambda s_{util}(x)+(1-\lambda)s_{real}(x)$. The authors prove that, under mild assumptions, this filtering yields a monotone improvement of the surrogate $\widetilde{F}_\beta(\theta)$ for $\beta\ge 1$ and does not inflate the Brier score. Empirically, AF-SMOTE improves recall and average precision and achieves the best calibration on MIMIC-IV proxy diagnosis and fraud benchmarks, with robust gains in high-dimensional settings via lightweight PCA pre-processing, demonstrating practical value for clinical and other high-stakes applications.
Abstract
We study classification under extreme class imbalance where recall and calibration are both critical, for example in medical diagnosis scenarios. We propose AF-SMOTE, a mathematically motivated augmentation framework that first synthesizes minority points and then filters them by an adversarial discriminator and a boundary utility model. We prove that, under mild assumptions on the decision boundary smoothness and class-conditional densities, our filtering step monotonically improves a surrogate of F_beta (for beta >= 1) while not inflating Brier score. On MIMIC-IV proxy label prediction and canonical fraud detection benchmarks, AF-SMOTE attains higher recall and average precision than strong oversampling baselines (SMOTE, ADASYN, Borderline-SMOTE, SVM-SMOTE), and yields the best calibration. We further validate these gains across multiple additional datasets beyond MIMIC-IV. Our successful application of AF-SMOTE to a healthcare dataset using a proxy label demonstrates in a disease-agnostic way its practical value in clinical situations, where missing true positive cases in rare diseases can have severe consequences.
