Robust spectral clustering with rank statistics

Joshua Cape; Xianshi Yu; Jonquil Z. Liao

Robust spectral clustering with rank statistics

Joshua Cape, Xianshi Yu, Jonquil Z. Liao

TL;DR

This work develops a robust spectral clustering framework based on an entrywise rank transform of the data matrix, enabling reliable latent block recovery under heavy-tailed and heterogeneous-variance conditions. By passing to ranks to obtain $ ilde{R}_A$, the leading eigenspace of the spectral embedding consistently estimates the population block structure and, under suitable conditions, yields asymptotic normality for embedded nodes. The theory covers weak consistency, node-specific strong consistency, and distributional limits, all without relying on moment conditions, and is complemented by numerical examples and a connectome application showing improved clustering and parsimonious embeddings. The approach provides practical, parameter-free robustness in spectral clustering with potential extensions to broader data geometries beyond strict block models.

Abstract

This paper analyzes the statistical performance of a robust spectral clustering method for latent structure recovery in noisy data matrices. We consider eigenvector-based clustering applied to a matrix of nonparametric rank statistics that is derived entrywise from the raw, original data matrix. This approach is robust in the sense that, unlike traditional spectral clustering procedures, it can provably recover population-level latent block structure even when the observed data matrix includes heavy-tailed entries and has a heterogeneous variance profile. Our main theoretical contributions are threefold and hold under flexible data generating conditions. First, we establish that robust spectral clustering with rank statistics can consistently recover latent block structure, viewed as communities of nodes in a graph, in the sense that unobserved community memberships for all but a vanishing fraction of nodes are correctly recovered with high probability when the data matrix is large. Second, we refine the former result and further establish that, under certain conditions, the community membership of any individual, specified node of interest can be asymptotically exactly recovered with probability tending to one in the large-data limit. Third, we establish asymptotic normality results associated with the truncated eigenstructure of matrices whose entries are rank statistics, made possible by synthesizing contemporary entrywise matrix perturbation analysis with the classical nonparametric theory of so-called simple linear rank statistics. Collectively, these results demonstrate the statistical utility of rank-based data transformations when paired with spectral techniques for dimensionality reduction. Additionally, for a dataset of human connectomes, our approach yields parsimonious dimensionality reduction and improved recovery of ground-truth neuroanatomical cluster structure.

Robust spectral clustering with rank statistics

TL;DR

, the leading eigenspace of the spectral embedding consistently estimates the population block structure and, under suitable conditions, yields asymptotic normality for embedded nodes. The theory covers weak consistency, node-specific strong consistency, and distributional limits, all without relying on moment conditions, and is complemented by numerical examples and a connectome application showing improved clustering and parsimonious embeddings. The approach provides practical, parameter-free robustness in spectral clustering with potential extensions to broader data geometries beyond strict block models.

Abstract

Paper Structure (46 sections, 27 theorems, 231 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 46 sections, 27 theorems, 231 equations, 8 figures, 1 table, 3 algorithms.

Introduction and background
Overview and contributions of this paper
Notation
Robust spectral embedding and clustering
Matrix-valued data with latent block structure
Matrices of normalized rank statistics
Eigenvector preliminaries for block-structured matrices
An example with contaminated normal distributions
Statistical considerations for clustering
Main results
Bounding matrices of normalized rank statistics
Weak consistency
Node-specific strong consistency
Asymptotic normality
Selecting the embedding dimension
...and 31 more sections

Key Result

Lemma 3

[lemma]lem:os_trace Let $A$ be a symmetric $n \times n$ random matrix with i.i.d. entries $A_{ij} \sim F$ for $i \leq j$, where $F$ is an absolutely continuous cumulative distribution function. Let $\widetilde{R}_{A}$ denote the matrix of normalized rank statistics, per alg:ptr_baseline. If $n \geq

Figures (8)

Figure 1: Simulation example in \ref{['sec:example_illustration']} showing the truncated eigenstructure of $A$ (left column) and $\widetilde{R}_{A}$ (right column). Blue lines denote the population-level ground truth eigenvector behavior. Gray points are derived from the simulated data. Several outlier values outside the interval $[-2,2]$ are not shown. The two-dimensional eigentruncation of the normalized ranks matrix recovers the unobserved two-block ground truth structure.
Figure 2: Simulation example in \ref{['sec:numerical_pareto']}. Row and column indices indicate corresponding node labels. Nodes are ordered to illustrate approximate block structure. Brown dots on the left represent values of the original data below the $1\%$ percentile or above the $99\%$ percentile; these values corrupt the raw data embedding in \ref{['fig:pareto_embed']}. Here, the matrix of normalized ranks reflects $K=2$ and satisfies the conditions for consistency in \ref{['sec:main_results']}.
Figure 3: Four-dimensional embedding of the raw data (Panel A) and after passing to ranks (Panel B) in \ref{['sec:numerical_pareto']}. Coordinates correspond to the four leading eigenvectors after scaling. The underlying population-level true dimension equals two. Colors and point shapes both denote unobserved block memberships. Perfect clustering is apparent in the first column of Panel B displaying the robust rank-based embedding.
Figure 4: Shown are overlaid embedding point clouds derived from $A$ and $\widetilde{R}_{A^{\prime}}$ in \ref{['sec:numerical_overlaid_pointclouds']}, where $A^{\prime}$ is corrupted by Cauchy entries. Coordinates correspond to the three leading eigenvectors after rotation and scaling. The above-diagonal panels show correlations for cluster-specific embeddings, asterisks indicate their level of statistical significance under the null hypothesis of zero correlation, and cluster labels are denoted by '1', '2', '3'. Using normalized rank statistics preserves properties of the original embedding.
Figure 5: The ratio of asymptotic variances per \ref{['eq:cluster_ARE_ratio']} is plotted as a function of underlying model parameters $(\mu, \gamma)$ and $(\mu, \sigma)$, respectively. Ratio values larger than one (i.e., above the dashed black line) indicate asymptotic, theoretical grounds on which to favor robust spectral clustering, and vice versa. Plotted curves are obtained by the numerical evaluation of analytically derived functions, not by simulation. See \ref{['sec:numerical_asymp_covariance']} for details.
...and 3 more figures

Theorems & Definitions (35)

Definition 1: Data matrices with blockmodel structure
Remark 2: Population-level block structure
Lemma 3: Single block expected trace bound
Lemma 4: Multiple block expected trace bound
Theorem 5: Weak consistency of robust spectral clustering with approximate $k$-means clustering
Corollary 6: Weak consistency of robust spectral clustering via relative error bounds
Theorem 7: Top eigenspace relative error of robust embedding
Theorem 8: Node-specific strong consistency of robust spectral clustering, general case
Corollary 9: Node-specific strong consistency of robust spectral clustering, special case
Remark 10: Population spectral gap and blockmodel distributions
...and 25 more

Robust spectral clustering with rank statistics

TL;DR

Abstract

Robust spectral clustering with rank statistics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (35)