Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance
Jyotishka Ray Choudhury, Aytijhya Saha, Sarbojit Roy, Subhajit Dutta
TL;DR
The paper tackles HDLSS binary classification by developing tuning-parameter-free, robust classifiers based on a data-adaptive energy distance $\mathcal{W}^{*}_{\mathbf{FG}}$ and its marginals. It introduces three refinements—$({\overline{\mathcal{W}}}^{*}_{FG},\, \bar{\tau}_{FG},\, \bar{\psi}_{FG})$—each paired with a discriminant ($\mathscr{D}_1,\mathscr{D}_2,\mathscr{D}_3$) to yield the classifiers $\delta_1,\delta_2,\delta_3$, with estimators $\hat{T}_{FF},\hat{T}_{FG},\hat{T}_{GG}$ driving the decisions. The authors prove HDLSS asymptotics showing misclassification probabilities vanish (i.e., $\Delta_1,\Delta_2 \to 0$ and $\Delta_3 \to 0$ under an extra condition) and compare the classifiers under various separation regimes. Empirically, the methods outperform standard HDLSS approaches, including GLMNET, SVM, NN-RP, and 1-NN, on simulations and six real HDLSS datasets, while remaining robust to outliers and free of tuning parameters.
Abstract
Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.
