Table of Contents
Fetching ...

Minimax-Optimal Dimension-Reduced Clustering for High-Dimensional Nonspherical Mixtures

Chengzhu Huang, Yuqi Gu

TL;DR

The paper tackles clustering under high-dimensional nonspherical (anisotropic) Gaussian mixtures and reveals an information-theoretic dimension-reduction phenomenon: the minimax clustering risk depends only on projections of centers and covariances onto the subspace spanned by the cluster centers. It introduces Covariance Projected Spectral Clustering (COPO), which projects data onto the top-$K$ singular subspace and uses projected covariances to refine clustering, achieving minimax-optimal rates in Gaussian settings and adapting to broad dependent noise structures via universality results. The authors establish a new minimax lower bound based on projected SNR, derive universal upper bounds for COPO under Gaussian and non-Gaussian noise with local dependence, and validate performance through extensive simulations and a HapMap3 real-data analysis. Together, these results reveal a practical, subspace-aware approach that is both computationally efficient and theoretically near-optimal for high-dimensional, heteroskedastic mixtures, with broad applicability to non-Gaussian settings. These findings have significant implications for high-dimensional clustering, offering a principled path to dimension reduction and robust, covariance-aware clustering in complex data regimes.

Abstract

In mixture models, nonspherical (anisotropic) noise within each cluster is widely present in real-world data. We study both the minimax rate and optimal statistical procedure for clustering under high-dimensional nonspherical mixture models. In high-dimensional settings, we first establish the information-theoretic limits for clustering under Gaussian mixtures. The minimax lower bound unveils an intriguing informational dimension-reduction phenomenon: there exists a substantial gap between the minimax rate and the oracle clustering risk, with the former determined solely by the projected centers and projected covariance matrices in a low-dimensional space. Motivated by the lower bound, we propose a novel computationally efficient clustering method: Covariance Projected Spectral Clustering (COPO). Its key step is to project the high-dimensional data onto the low-dimensional space spanned by the cluster centers and then use the projected covariance matrices in this space to enhance clustering. We establish tight algorithmic upper bounds for COPO, both for Gaussian noise with flexible covariance and general noise with local dependence. Our theory indicates the minimax-optimality of COPO in the Gaussian case and highlights its adaptivity to a broad spectrum of dependent noise. Extensive simulation studies under various noise structures and real data analysis demonstrate our method's superior performance.

Minimax-Optimal Dimension-Reduced Clustering for High-Dimensional Nonspherical Mixtures

TL;DR

The paper tackles clustering under high-dimensional nonspherical (anisotropic) Gaussian mixtures and reveals an information-theoretic dimension-reduction phenomenon: the minimax clustering risk depends only on projections of centers and covariances onto the subspace spanned by the cluster centers. It introduces Covariance Projected Spectral Clustering (COPO), which projects data onto the top- singular subspace and uses projected covariances to refine clustering, achieving minimax-optimal rates in Gaussian settings and adapting to broad dependent noise structures via universality results. The authors establish a new minimax lower bound based on projected SNR, derive universal upper bounds for COPO under Gaussian and non-Gaussian noise with local dependence, and validate performance through extensive simulations and a HapMap3 real-data analysis. Together, these results reveal a practical, subspace-aware approach that is both computationally efficient and theoretically near-optimal for high-dimensional, heteroskedastic mixtures, with broad applicability to non-Gaussian settings. These findings have significant implications for high-dimensional clustering, offering a principled path to dimension reduction and robust, covariance-aware clustering in complex data regimes.

Abstract

In mixture models, nonspherical (anisotropic) noise within each cluster is widely present in real-world data. We study both the minimax rate and optimal statistical procedure for clustering under high-dimensional nonspherical mixture models. In high-dimensional settings, we first establish the information-theoretic limits for clustering under Gaussian mixtures. The minimax lower bound unveils an intriguing informational dimension-reduction phenomenon: there exists a substantial gap between the minimax rate and the oracle clustering risk, with the former determined solely by the projected centers and projected covariance matrices in a low-dimensional space. Motivated by the lower bound, we propose a novel computationally efficient clustering method: Covariance Projected Spectral Clustering (COPO). Its key step is to project the high-dimensional data onto the low-dimensional space spanned by the cluster centers and then use the projected covariance matrices in this space to enhance clustering. We establish tight algorithmic upper bounds for COPO, both for Gaussian noise with flexible covariance and general noise with local dependence. Our theory indicates the minimax-optimality of COPO in the Gaussian case and highlights its adaptivity to a broad spectrum of dependent noise. Extensive simulation studies under various noise structures and real data analysis demonstrate our method's superior performance.

Paper Structure

This paper contains 62 sections, 39 theorems, 334 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

If $n = o(p)$, $\mathsf{SNR}_0\rightarrow \infty$, and consider a broad class of parameters $\mathbf \Theta_\alpha$ where $\mathsf{SNR}(\{\boldsymbol \theta_j^*\}_{j\in[K]}, \{\mathbf \Sigma_j\}_{j\in[K]}) \geq \mathsf{SNR}_0$ and $\exp(-{\mathsf{SNR}_0^2}/{2})$ is much larger than the Bayesian o

Figures (6)

  • Figure 1: Comparing spectral clustering loffler2021optimality and COPO in the top-2 right singular subspace of $\mathbf Y_{n\times p}$, with $n = 500$ and $p = 1000$. From the left to right are results of spectral clustering, first, second, and third iterations of COPO. "Err." refers to clustering errors, counting the numbers of light green misclustered points. Dashed lines are the decision boundaries, straight lines for spectral clustering, and elliptical (Figure \ref{['fig: ellipse']}) and hyperbolic (Figure \ref{['fig: hyperbola']}) for COPO.
  • Figure 2: Histogram of $(\mathbf U \mathbf R_{\mathbf U} - \mathbf U^*)_{1,1}$ with noise entries obeying different distributions.
  • Figure 3: Clustering error rates with varying dimensions for Ising mixtures, multivariate Probit mixtures, multivariate Gamma mixtures, and multivariate Negative Binomial mixtures.
  • Figure 4: Pair plot of the top right singular vectors for the full data and a subset of the data with two subpopulations of the HapMap3 dataset.
  • Figure 5: Contours and decision boundaries for the subpopulations CEU and MEX of the HapMap3 dataset. The first subfigure shows the decision boundary of spectral clustering, and the second to the fourth ones illustrate the first three steps of the COPO algorithm.
  • ...and 1 more figures

Theorems & Definitions (52)

  • Theorem : Informal Lower Bound; formal versions in Theorem \ref{['theorem: gaussian lower bound']} and Theorem \ref{['theorem: gaussian lower bound with K components']}
  • Theorem : Informal Upper Bound; formal version in Theorem \ref{['theorem: upper bound for algorithm']}
  • Proposition 2.1: Homogeneous covariance matrices
  • Proposition 2.2: Covariance matrices homogeneous in most directions
  • Remark 1
  • Corollary 2.1
  • Theorem 2.3: Minimax Lower Bound for Two-component Gaussian Mixtures
  • Theorem 2.4: Minimax Lower Bound for $K$-component Gaussian Mixtures
  • Proposition 2.5
  • Theorem 4.4
  • ...and 42 more