Table of Contents
Fetching ...

Robust spectral clustering with rank statistics

Joshua Cape, Xianshi Yu, Jonquil Z. Liao

TL;DR

This work develops a robust spectral clustering framework based on an entrywise rank transform of the data matrix, enabling reliable latent block recovery under heavy-tailed and heterogeneous-variance conditions. By passing to ranks to obtain $ ilde{R}_A$, the leading eigenspace of the spectral embedding consistently estimates the population block structure and, under suitable conditions, yields asymptotic normality for embedded nodes. The theory covers weak consistency, node-specific strong consistency, and distributional limits, all without relying on moment conditions, and is complemented by numerical examples and a connectome application showing improved clustering and parsimonious embeddings. The approach provides practical, parameter-free robustness in spectral clustering with potential extensions to broader data geometries beyond strict block models.

Abstract

This paper analyzes the statistical performance of a robust spectral clustering method for latent structure recovery in noisy data matrices. We consider eigenvector-based clustering applied to a matrix of nonparametric rank statistics that is derived entrywise from the raw, original data matrix. This approach is robust in the sense that, unlike traditional spectral clustering procedures, it can provably recover population-level latent block structure even when the observed data matrix includes heavy-tailed entries and has a heterogeneous variance profile. Our main theoretical contributions are threefold and hold under flexible data generating conditions. First, we establish that robust spectral clustering with rank statistics can consistently recover latent block structure, viewed as communities of nodes in a graph, in the sense that unobserved community memberships for all but a vanishing fraction of nodes are correctly recovered with high probability when the data matrix is large. Second, we refine the former result and further establish that, under certain conditions, the community membership of any individual, specified node of interest can be asymptotically exactly recovered with probability tending to one in the large-data limit. Third, we establish asymptotic normality results associated with the truncated eigenstructure of matrices whose entries are rank statistics, made possible by synthesizing contemporary entrywise matrix perturbation analysis with the classical nonparametric theory of so-called simple linear rank statistics. Collectively, these results demonstrate the statistical utility of rank-based data transformations when paired with spectral techniques for dimensionality reduction. Additionally, for a dataset of human connectomes, our approach yields parsimonious dimensionality reduction and improved recovery of ground-truth neuroanatomical cluster structure.

Robust spectral clustering with rank statistics

TL;DR

This work develops a robust spectral clustering framework based on an entrywise rank transform of the data matrix, enabling reliable latent block recovery under heavy-tailed and heterogeneous-variance conditions. By passing to ranks to obtain , the leading eigenspace of the spectral embedding consistently estimates the population block structure and, under suitable conditions, yields asymptotic normality for embedded nodes. The theory covers weak consistency, node-specific strong consistency, and distributional limits, all without relying on moment conditions, and is complemented by numerical examples and a connectome application showing improved clustering and parsimonious embeddings. The approach provides practical, parameter-free robustness in spectral clustering with potential extensions to broader data geometries beyond strict block models.

Abstract

This paper analyzes the statistical performance of a robust spectral clustering method for latent structure recovery in noisy data matrices. We consider eigenvector-based clustering applied to a matrix of nonparametric rank statistics that is derived entrywise from the raw, original data matrix. This approach is robust in the sense that, unlike traditional spectral clustering procedures, it can provably recover population-level latent block structure even when the observed data matrix includes heavy-tailed entries and has a heterogeneous variance profile. Our main theoretical contributions are threefold and hold under flexible data generating conditions. First, we establish that robust spectral clustering with rank statistics can consistently recover latent block structure, viewed as communities of nodes in a graph, in the sense that unobserved community memberships for all but a vanishing fraction of nodes are correctly recovered with high probability when the data matrix is large. Second, we refine the former result and further establish that, under certain conditions, the community membership of any individual, specified node of interest can be asymptotically exactly recovered with probability tending to one in the large-data limit. Third, we establish asymptotic normality results associated with the truncated eigenstructure of matrices whose entries are rank statistics, made possible by synthesizing contemporary entrywise matrix perturbation analysis with the classical nonparametric theory of so-called simple linear rank statistics. Collectively, these results demonstrate the statistical utility of rank-based data transformations when paired with spectral techniques for dimensionality reduction. Additionally, for a dataset of human connectomes, our approach yields parsimonious dimensionality reduction and improved recovery of ground-truth neuroanatomical cluster structure.
Paper Structure (46 sections, 27 theorems, 231 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 46 sections, 27 theorems, 231 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Lemma 3

[lemma]lem:os_trace Let $A$ be a symmetric $n \times n$ random matrix with i.i.d. entries $A_{ij} \sim F$ for $i \leq j$, where $F$ is an absolutely continuous cumulative distribution function. Let $\widetilde{R}_{A}$ denote the matrix of normalized rank statistics, per alg:ptr_baseline. If $n \geq

Figures (8)

  • Figure 1: Simulation example in \ref{['sec:example_illustration']} showing the truncated eigenstructure of $A$ (left column) and $\widetilde{R}_{A}$ (right column). Blue lines denote the population-level ground truth eigenvector behavior. Gray points are derived from the simulated data. Several outlier values outside the interval $[-2,2]$ are not shown. The two-dimensional eigentruncation of the normalized ranks matrix recovers the unobserved two-block ground truth structure.
  • Figure 2: Simulation example in \ref{['sec:numerical_pareto']}. Row and column indices indicate corresponding node labels. Nodes are ordered to illustrate approximate block structure. Brown dots on the left represent values of the original data below the $1\%$ percentile or above the $99\%$ percentile; these values corrupt the raw data embedding in \ref{['fig:pareto_embed']}. Here, the matrix of normalized ranks reflects $K=2$ and satisfies the conditions for consistency in \ref{['sec:main_results']}.
  • Figure 3: Four-dimensional embedding of the raw data (Panel A) and after passing to ranks (Panel B) in \ref{['sec:numerical_pareto']}. Coordinates correspond to the four leading eigenvectors after scaling. The underlying population-level true dimension equals two. Colors and point shapes both denote unobserved block memberships. Perfect clustering is apparent in the first column of Panel B displaying the robust rank-based embedding.
  • Figure 4: Shown are overlaid embedding point clouds derived from $A$ and $\widetilde{R}_{A^{\prime}}$ in \ref{['sec:numerical_overlaid_pointclouds']}, where $A^{\prime}$ is corrupted by Cauchy entries. Coordinates correspond to the three leading eigenvectors after rotation and scaling. The above-diagonal panels show correlations for cluster-specific embeddings, asterisks indicate their level of statistical significance under the null hypothesis of zero correlation, and cluster labels are denoted by '1', '2', '3'. Using normalized rank statistics preserves properties of the original embedding.
  • Figure 5: The ratio of asymptotic variances per \ref{['eq:cluster_ARE_ratio']} is plotted as a function of underlying model parameters $(\mu, \gamma)$ and $(\mu, \sigma)$, respectively. Ratio values larger than one (i.e., above the dashed black line) indicate asymptotic, theoretical grounds on which to favor robust spectral clustering, and vice versa. Plotted curves are obtained by the numerical evaluation of analytically derived functions, not by simulation. See \ref{['sec:numerical_asymp_covariance']} for details.
  • ...and 3 more figures

Theorems & Definitions (35)

  • Definition 1: Data matrices with blockmodel structure
  • Remark 2: Population-level block structure
  • Lemma 3: Single block expected trace bound
  • Lemma 4: Multiple block expected trace bound
  • Theorem 5: Weak consistency of robust spectral clustering with approximate $k$-means clustering
  • Corollary 6: Weak consistency of robust spectral clustering via relative error bounds
  • Theorem 7: Top eigenspace relative error of robust embedding
  • Theorem 8: Node-specific strong consistency of robust spectral clustering, general case
  • Corollary 9: Node-specific strong consistency of robust spectral clustering, special case
  • Remark 10: Population spectral gap and blockmodel distributions
  • ...and 25 more