Table of Contents
Fetching ...

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

Jyotishka Ray Choudhury, Aytijhya Saha, Sarbojit Roy, Subhajit Dutta

TL;DR

The paper tackles HDLSS binary classification by developing tuning-parameter-free, robust classifiers based on a data-adaptive energy distance $\mathcal{W}^{*}_{\mathbf{FG}}$ and its marginals. It introduces three refinements—$({\overline{\mathcal{W}}}^{*}_{FG},\, \bar{\tau}_{FG},\, \bar{\psi}_{FG})$—each paired with a discriminant ($\mathscr{D}_1,\mathscr{D}_2,\mathscr{D}_3$) to yield the classifiers $\delta_1,\delta_2,\delta_3$, with estimators $\hat{T}_{FF},\hat{T}_{FG},\hat{T}_{GG}$ driving the decisions. The authors prove HDLSS asymptotics showing misclassification probabilities vanish (i.e., $\Delta_1,\Delta_2 \to 0$ and $\Delta_3 \to 0$ under an extra condition) and compare the classifiers under various separation regimes. Empirically, the methods outperform standard HDLSS approaches, including GLMNET, SVM, NN-RP, and 1-NN, on simulations and six real HDLSS datasets, while remaining robust to outliers and free of tuning parameters.

Abstract

Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

TL;DR

The paper tackles HDLSS binary classification by developing tuning-parameter-free, robust classifiers based on a data-adaptive energy distance and its marginals. It introduces three refinements——each paired with a discriminant () to yield the classifiers , with estimators driving the decisions. The authors prove HDLSS asymptotics showing misclassification probabilities vanish (i.e., and under an extra condition) and compare the classifiers under various separation regimes. Empirically, the methods outperform standard HDLSS approaches, including GLMNET, SVM, NN-RP, and 1-NN, on simulations and six real HDLSS datasets, while remaining robust to outliers and free of tuning parameters.

Abstract

Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.
Paper Structure (20 sections, 16 theorems, 78 equations, 2 figures, 7 tables)

This paper contains 20 sections, 16 theorems, 78 equations, 2 figures, 7 tables.

Key Result

Theorem 2.1

Suppose ass:0.1ass:0.2ass:0.3 are satisfied. Then, $\theta^{*}_{\mathbf{FG}}=\lim_{d \to \infty} \mathcal{W}^{*}_{\mathbf{FG}}$ is finite, and for a test observation $\mathbf{Z} ,$

Figures (2)

  • Figure 1: Average misclassification rates with errorbars for $\delta_0$, along with some popular classifiers for increasing dimensions. Bayes classifier is treated as a benchmark.
  • Figure 2: Average misclassification rates with errorbars for $\delta_1 , \delta_2$, and $\delta_3,$ along with some popular classifiers for different dimensions. The Bayes classifier (assuming that the competing distributions in all examples are known) is treated as a benchmark.

Theorems & Definitions (16)

  • Theorem 2.1
  • Theorem 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Lemma A.1
  • Lemma A.2
  • Lemma A.3
  • Lemma A.4
  • ...and 6 more