Table of Contents
Fetching ...

Statistical Inference in Classification of High-dimensional Gaussian Mixture

Hanwen Huang, Peng Zeng

TL;DR

This work investigates the asymptotic behavior of a broad class of regularized convex classifiers in the limit where both the sample size n and the dimension p approach infinity while their ratio α=n/p remains fixed.

Abstract

We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size $n$ and the dimension $p$ approach infinity while their ratio $α=n/p$ remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using $L_1$-regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.

Statistical Inference in Classification of High-dimensional Gaussian Mixture

TL;DR

This work investigates the asymptotic behavior of a broad class of regularized convex classifiers in the limit where both the sample size n and the dimension p approach infinity while their ratio α=n/p remains fixed.

Abstract

We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size and the dimension approach infinity while their ratio remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using -regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.

Paper Structure

This paper contains 7 sections, 2 theorems, 56 equations, 4 figures.

Key Result

Proposition 1

Define two random vectors where $\hat{\boldsymbol{\mu}} = \boldsymbol{\mu}/\|\boldsymbol{\mu}\|$, $\hat{{\bf w}}$ is the minimizer of (class), ${\bf z}\sim N(0,{\bf I}_{p\times p})$, $\tau=\sqrt{\zeta_0}/\zeta$, and $\zeta_0,\zeta,R_0$ can be solved from the following set of nonlinear equations: where the first three expectations are with respect to $\epsilon\sim N(0,1)$, the last three expectat

Figures (4)

  • Figure 1: Comparison between theoretical and empirical precision rates for four different correlation structures: IID (top-left), block (top-right), AR1 (bottom-left), and banded (bottom-right). In each plot, the three lines are the theoretical precision rates at different sparsity levels $\epsilon = 0.01, 0.05, 0.1$. The error bars are the 95% confidence intervals of the mean precision rates based on 500 replicates.
  • Figure 2: Histograms of the components of PLR estimator $\hat{{\bf w}}$ (left) and the corresponding de-biased estimator $\bar{{\bf w}}$ (right) for a typical dataset. In the right plot, the curves represent the asymptotic normal densities for the zero and nonzero components.
  • Figure 3: Boxplots of empirical confidence levels based on 500 replicates for four correlation structures: IID (top-left), block (top-right), AR1 (bottom-left), and banded (bottom-right). In each plot, the horizontal line indicates the nominal confidence level. Different rows of plots represent different sparsity levels, $\epsilon = 0.01, 0.05, 0.1$.
  • Figure 4: Comparison between theoretical and empirical powers of hypothesis testing for four different correlation structures: IID (top-left), block (top-right), AR1 (bottom-left), and banded (bottom-right). In each plot, the three lines are the theoretical powers at different sparsity levels $\epsilon = 0.01, 0.05, 0.1$. The error bars are the 95% confidence intervals of the mean powers based on 500 replicates.

Theorems & Definitions (2)

  • Proposition 1
  • Corollary 1