Table of Contents
Fetching ...

High-dimensional Clustering and Signal Recovery under Block Signals

Wu Su, Yumou Qiu

TL;DR

This work develops computationally efficient, minimax-optimal approaches for high-dimensional clustering and signal recovery under block-structured means. By separating sparse-block (CFA-PCA) and dense-block (MA-PCA) regimes, it leverages block contiguity to boost statistical power under general sub-Gaussian noise and bandable covariances. The authors establish both statistical and computational minimax lower bounds, revealing phase transitions between impossibility and possibility under polynomial-time constraints, and demonstrate that CFA-PCA and MA-PCA attain these bounds in their respective regimes. Extensions to tensor data and extensive simulations, plus a real-world case study on global temperature changes, validate the practical utility of exploiting block structure in high-dimensional clustering and signal identification.

Abstract

This paper studies computationally efficient methods and their minimax optimality for high-dimensional clustering and signal recovery under block signal structures. We propose two sets of methods, cross-block feature aggregation PCA (CFA-PCA) and moving average PCA (MA-PCA), designed for sparse and dense block signals, respectively. Both methods adaptively utilize block signal structures, applicable to non-Gaussian data with heterogeneous variances and non-diagonal covariance matrices. Specifically, the CFA method utilizes a block-wise U-statistic to aggregate and select block signals non-parametrically from data with unknown cluster labels. We show that the proposed methods are consistent for both clustering and signal recovery under mild conditions and weaker signal strengths than the existing methods without considering block structures of signals. Furthermore, we derive both statistical and computational minimax lower bounds (SMLB and CMLB) for high-dimensional clustering and signal recovery under block signals, where the CMLBs are restricted to algorithms with polynomial computation complexity. The minimax boundaries partition signals into regions of impossibility and possibility. No algorithm (or no polynomial time algorithm) can achieve consistent clustering or signal recovery if the signals fall into the statistical (or computational) region of impossibility. We show that the proposed CFA-PCA and MA-PCA methods can achieve the CMLBs for the sparse and dense block signal regimes, respectively, indicating the proposed methods are computationally minimax optimal. A tuning parameter selection method is proposed based on post-clustering signal recovery results. Simulation studies are conducted to evaluate the proposed methods. A case study on global temperature change demonstrates their utility in practice.

High-dimensional Clustering and Signal Recovery under Block Signals

TL;DR

This work develops computationally efficient, minimax-optimal approaches for high-dimensional clustering and signal recovery under block-structured means. By separating sparse-block (CFA-PCA) and dense-block (MA-PCA) regimes, it leverages block contiguity to boost statistical power under general sub-Gaussian noise and bandable covariances. The authors establish both statistical and computational minimax lower bounds, revealing phase transitions between impossibility and possibility under polynomial-time constraints, and demonstrate that CFA-PCA and MA-PCA attain these bounds in their respective regimes. Extensions to tensor data and extensive simulations, plus a real-world case study on global temperature changes, validate the practical utility of exploiting block structure in high-dimensional clustering and signal identification.

Abstract

This paper studies computationally efficient methods and their minimax optimality for high-dimensional clustering and signal recovery under block signal structures. We propose two sets of methods, cross-block feature aggregation PCA (CFA-PCA) and moving average PCA (MA-PCA), designed for sparse and dense block signals, respectively. Both methods adaptively utilize block signal structures, applicable to non-Gaussian data with heterogeneous variances and non-diagonal covariance matrices. Specifically, the CFA method utilizes a block-wise U-statistic to aggregate and select block signals non-parametrically from data with unknown cluster labels. We show that the proposed methods are consistent for both clustering and signal recovery under mild conditions and weaker signal strengths than the existing methods without considering block structures of signals. Furthermore, we derive both statistical and computational minimax lower bounds (SMLB and CMLB) for high-dimensional clustering and signal recovery under block signals, where the CMLBs are restricted to algorithms with polynomial computation complexity. The minimax boundaries partition signals into regions of impossibility and possibility. No algorithm (or no polynomial time algorithm) can achieve consistent clustering or signal recovery if the signals fall into the statistical (or computational) region of impossibility. We show that the proposed CFA-PCA and MA-PCA methods can achieve the CMLBs for the sparse and dense block signal regimes, respectively, indicating the proposed methods are computationally minimax optimal. A tuning parameter selection method is proposed based on post-clustering signal recovery results. Simulation studies are conducted to evaluate the proposed methods. A case study on global temperature change demonstrates their utility in practice.

Paper Structure

This paper contains 8 sections, 7 theorems, 19 equations, 4 figures, 2 algorithms.

Key Result

Theorem 1

For the clustering problem under the model in eq:mixgauss, under Assumptions assu:sub_gaussian-assu:strength, $\boldsymbol{\Sigma} \in \mathfrak{U}(\gamma, C)$ in eq:bandable, $\max \{ \| \boldsymbol{\Sigma} \|_2, \| \boldsymbol{\Sigma}^{-1} \|_2\} \le C$ and $\sqrt{n} b \tilde{d}^{-\gamma-1} = o(1) for a positive constant $c_3$, then we have $\min\{\|\hat{\boldsymbol{\ell}}_{\mathrm{cfa}} + \bold

Figures (4)

  • Figure 1: Phase transition of clustering in panel (a) and signal recovery in panel (b). Gray region: consistent clustering or signal recovery is impossible by any algorithm. Purple region: consistent clustering or signal recovery is impossible under the polynomial computation time constraint. Red region: consistent clustering or signal recovery can be achieved by a polynomial-time algorithm. Black dash line: CMLBs under non-block signals ($\alpha = 0$).
  • Figure 2: Simulation results under the non-block signal settings. The clustering errors (upper panels) and signal recovery errors (lower panels) of CFA-PCA, MA-PCA, IF-PCA, spectral clustering and $k$-means under different signal strengths, averaged over 500 repeated experiments. The rows represent the dense and sparse signal scenarios and the columns represent different dimensions.
  • Figure 3: Simulation results under the block signal settings in Table \ref{['tab:sim_settings']}. The clustering errors (upper panels) and signal recovery errors (lower panels) of CFA-PCA, MA-PCA, IF-PCA, spectral clustering and $k$-means under different signal strengths, averaged over 500 repeated experiments. The rows represent the dense and sparse signal scenarios and the columns represent different dimensions.
  • Figure 4: The estimated cluster means of the yearly surface temperature change by $k$-means (upper two panels) and CFA-PCA (middle two panels). The shaded areas indicate the identified signal blocks. The bottom panel shows the clustering results of CFA-PCA and the annual Niño 3.4 index (black line), where the red and gray shading represent the years assigned to the two clusters.

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Conjecture 1: Low-degree polynomial conjecture
  • Theorem 5
  • Theorem 6
  • Corollary 1