Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

Jyotishka Ray Choudhury; Aytijhya Saha; Sarbojit Roy; Subhajit Dutta

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

Jyotishka Ray Choudhury, Aytijhya Saha, Sarbojit Roy, Subhajit Dutta

TL;DR

The paper tackles HDLSS binary classification by developing tuning-parameter-free, robust classifiers based on a data-adaptive energy distance $\mathcal{W}^{*}_{\mathbf{FG}}$ and its marginals. It introduces three refinements—$({\overline{\mathcal{W}}}^{*}_{FG},\, \bar{\tau}_{FG},\, \bar{\psi}_{FG})$—each paired with a discriminant ($\mathscr{D}_1,\mathscr{D}_2,\mathscr{D}_3$) to yield the classifiers $\delta_1,\delta_2,\delta_3$, with estimators $\hat{T}_{FF},\hat{T}_{FG},\hat{T}_{GG}$ driving the decisions. The authors prove HDLSS asymptotics showing misclassification probabilities vanish (i.e., $\Delta_1,\Delta_2 \to 0$ and $\Delta_3 \to 0$ under an extra condition) and compare the classifiers under various separation regimes. Empirically, the methods outperform standard HDLSS approaches, including GLMNET, SVM, NN-RP, and 1-NN, on simulations and six real HDLSS datasets, while remaining robust to outliers and free of tuning parameters.

Abstract

Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

TL;DR

The paper tackles HDLSS binary classification by developing tuning-parameter-free, robust classifiers based on a data-adaptive energy distance

and its marginals. It introduces three refinements—

—each paired with a discriminant (

) to yield the classifiers

, with estimators

driving the decisions. The authors prove HDLSS asymptotics showing misclassification probabilities vanish (i.e.,

and

under an extra condition) and compare the classifiers under various separation regimes. Empirically, the methods outperform standard HDLSS approaches, including GLMNET, SVM, NN-RP, and 1-NN, on simulations and six real HDLSS datasets, while remaining robust to outliers and free of tuning parameters.

Abstract

Paper Structure (20 sections, 16 theorems, 78 equations, 2 figures, 7 tables)

This paper contains 20 sections, 16 theorems, 78 equations, 2 figures, 7 tables.

Introduction
Our contributions
Methodology
A classifier based on $\mathcal{W}^{*}_{\mathbf{FG}}$
Refinements of $\delta_0$
A New Measure of Distance
Classifier Based on $\overline{\mathcal{W}}^{*}_{\mathbf{F G}}$
Classifier Based on $\bar{\tau}_{\mathbf{F G}}$
Classifier Based on $\bar{\psi}_{\mathbf{FG}}$
Asymptotics under HDLSS Regime
Misclassification Probabilities of $\delta_{1}, \delta_{2}$, and $\delta_{3}$ in the HDLSS asymptotic regime
Comparison of the classifiers
Empirical Performance and Results
Simulation Studies
Implementation on Real Data
...and 5 more sections

Key Result

Theorem 2.1

Suppose ass:0.1ass:0.2ass:0.3 are satisfied. Then, $\theta^{*}_{\mathbf{FG}}=\lim_{d \to \infty} \mathcal{W}^{*}_{\mathbf{FG}}$ is finite, and for a test observation $\mathbf{Z} ,$

Figures (2)

Figure 1: Average misclassification rates with errorbars for $\delta_0$, along with some popular classifiers for increasing dimensions. Bayes classifier is treated as a benchmark.
Figure 2: Average misclassification rates with errorbars for $\delta_1 , \delta_2$, and $\delta_3,$ along with some popular classifiers for different dimensions. The Bayes classifier (assuming that the competing distributions in all examples are known) is treated as a benchmark.

Theorems & Definitions (16)

Theorem 2.1
Theorem 2.2
Theorem 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4
Lemma A.1
Lemma A.2
Lemma A.3
Lemma A.4
...and 6 more

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

TL;DR

Abstract

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (16)