Robust Universum Twin Support Vector Machine for Imbalanced Data
M. Tanveer, A. Quadir
TL;DR
The paper tackles the challenge of imbalanced data in the presence of noise and outliers by introducing IFUTSVM-ID, a robust Universum Twin SVM variant that leverages intuitionistic fuzzy memberships and universum data. It provides both linear and nonlinear (kernel) formulations, incorporating a regularization term to enforce structural risk minimization and using undersampling/oversampling via universum data to balance classes. Empirical validation on 46 KEEL datasets and the ADNI cohort demonstrates superior accuracy and statistical significance over baselines, with demonstrated resilience to label noise. The work offers practical gains for real-world classification tasks, including biomedical settings like Alzheimer’s disease diagnosis, and points to future improvements in universum data selection and sampling strategies.
Abstract
One of the major difficulties in machine learning methods is categorizing datasets that are imbalanced. This problem may lead to biased models, where the training process is dominated by the majority class, resulting in inadequate representation of the minority class. Universum twin support vector machine (UTSVM) produces a biased model towards the majority class, as a result, its performance on the minority class is often poor as it might be mistakenly classified as noise. Moreover, UTSVM is not proficient in handling datasets that contain outliers and noises. Inspired by the concept of incorporating prior information about the data and employing an intuitionistic fuzzy membership scheme, we propose intuitionistic fuzzy UTSVM for imbalanced data (IFUTSVM-ID) by enhancing overall robustness. We use an intuitionistic fuzzy membership scheme to mitigate the impact of noise and outliers. Moreover, to tackle the problem of imbalanced class distribution, data oversampling and undersampling methods are utilized. Prior knowledge about the data is provided by universum data. This leads to better generalization performance. UTSVM is susceptible to overfitting risks due to the omission of the structural risk minimization (SRM) principle in their primal formulations. However, the proposed IFUTSVM-ID model incorporates the SRM principle through the incorporation of regularization terms, effectively addressing the issue of overfitting. We conduct a comprehensive evaluation of the proposed IFUTSVM-ID model on benchmark datasets from KEEL and compare it with existing baseline models. Furthermore, to assess the effectiveness of the proposed IFUTSVM-ID model in diagnosing Alzheimer's disease (AD), we applied them to the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Experimental results showcase the superiority of the proposed IFUTSVM-ID models compared to the baseline models.
