Table of Contents
Fetching ...

Asymptotic Gaussian Fluctuations of Eigenvectors in Spectral Clustering

Hugo Lebeau, Florent Chatelain, Romain Couillet

TL;DR

The paper addresses the exact fluctuations of eigenvector entries in spectral clustering under a general signal+noise spike model. It proves a central limit theorem showing that, in the large-dimension regime, the entries of the dominant eigenvectors of the Gram kernel $K = \frac{1}{p} \mathbf{X}^\top \mathbf{X}$ fluctuate Gaussianly around the underlying signal with variance proportional to $(1 - \zeta_k)/n$. The key approach hinges on the rotational invariance of the noise and a tangent-normal decomposition, yielding a universal result that applies beyond Gaussian noise to standard spike models. This enables precise predictions of classification performance and is validated on synthetic data and real-world Fashion-MNIST experiments, highlighting the practical impact for understanding and designing spectral clustering methods.

Abstract

The performance of spectral clustering relies on the fluctuations of the entries of the eigenvectors of a similarity matrix, which has been left uncharacterized until now. In this letter, it is shown that the signal $+$ noise structure of a general spike random matrix model is transferred to the eigenvectors of the corresponding Gram kernel matrix and the fluctuations of their entries are Gaussian in the large-dimensional regime. This CLT-like result was the last missing piece to precisely predict the classification performance of spectral clustering. The proposed proof is very general and relies solely on the rotational invariance of the noise. Numerical experiments on synthetic and real data illustrate the universality of this phenomenon.

Asymptotic Gaussian Fluctuations of Eigenvectors in Spectral Clustering

TL;DR

The paper addresses the exact fluctuations of eigenvector entries in spectral clustering under a general signal+noise spike model. It proves a central limit theorem showing that, in the large-dimension regime, the entries of the dominant eigenvectors of the Gram kernel fluctuate Gaussianly around the underlying signal with variance proportional to . The key approach hinges on the rotational invariance of the noise and a tangent-normal decomposition, yielding a universal result that applies beyond Gaussian noise to standard spike models. This enables precise predictions of classification performance and is validated on synthetic data and real-world Fashion-MNIST experiments, highlighting the practical impact for understanding and designing spectral clustering methods.

Abstract

The performance of spectral clustering relies on the fluctuations of the entries of the eigenvectors of a similarity matrix, which has been left uncharacterized until now. In this letter, it is shown that the signal noise structure of a general spike random matrix model is transferred to the eigenvectors of the corresponding Gram kernel matrix and the fluctuations of their entries are Gaussian in the large-dimensional regime. This CLT-like result was the last missing piece to precisely predict the classification performance of spectral clustering. The proposed proof is very general and relies solely on the rotational invariance of the noise. Numerical experiments on synthetic and real data illustrate the universality of this phenomenon.
Paper Structure (9 sections, 4 theorems, 10 equations, 3 figures)

This paper contains 9 sections, 4 theorems, 10 equations, 3 figures.

Key Result

Theorem 1

Let $(\lambda_k, \hat{{\bm{v}}}_k)_{k \in [K]}$ denote the dominant eigenvalue-eigenvector pairs of ${\bm{K}}$ such that $\lambda_1 \geqslant \ldots \geqslant \lambda_K$. Then, for all $k \in [K]$,

Figures (3)

  • Figure 1: Empirical Spectral Distribution (ESD) of ${\bm{K}} = \frac{1}{p} {\bm{X}}^\top {\bm{X}}$ and Marčenko-Pastur Distribution (MP). The green dashed lines are the positions $\xi_k$ of isolated eigenvalues predicted by Theorem \ref{['thm:spikes']}. Experimental setting: $n = 1000$, $p = 2000$, $K = 3$, $(n_1, n_2, n_3) = (333, 334, 333)$, $(\lVert {\bm{\mu}}_1 \rVert, \lVert {\bm{\mu}}_2 \rVert, \lVert {\bm{\mu}}_3 \rVert) = (3, 4, 5)$.
  • Figure 2: Dominant eigenvectors of ${\bm{K}} = \frac{1}{p} {\bm{X}}^\top {\bm{X}}$. Top: Coordinates of $\hat{{\bm{v}}}_k$ (blue) and the underlying signal $\sqrt{\zeta_k} {\bm{v}}_k$ (orange) with $\zeta_k$ given in Theorem \ref{['thm:spikes']}. The dotted orange lines are the $\pm 1 \sigma$-error curves deduced from Theorem \ref{['thm:clt']}. Bottom: Histogram of the entries of $\hat{{\bm{v}}}_k - \sqrt{\zeta_k} {\bm{v}}_k$ (blue) and probability density function of ${\mathcal{N}}(0, \frac{1 - \zeta_k}{n})$ (orange). Experimental setting: like in Figure \ref{['fig:lsd']}.
  • Figure 3: Observed (upper right, blue) and predicted (lower left, orange) classification accuracies of binary spectral clustering on the Fashion-MNIST dataset xiao_fashion-mnist_2017.

Theorems & Definitions (5)

  • Theorem 1: Spikes
  • Theorem 2
  • Lemma 1
  • proof
  • Theorem 3: schoenberg_metric_1938steerneman_spherical_2005