Spectral clustering in the Gaussian mixture block model

Shuangping Li; Tselil Schramm

Spectral clustering in the Gaussian mixture block model

Shuangping Li, Tselil Schramm

TL;DR

The paper studies clustering and embedding in graphs drawn from a high-dimensional Gaussian mixture block model (GMBM), where each node carries a latent $d$-dimensional feature from a two-component Gaussian mixture and edges appear if $\langle u_i,u_j\rangle \ge \tau$. It develops and analyzes a canonical spectral algorithm, using a sophisticated trace-method with Gegenbauer polynomials to show that the adjacency matrix $A$ can be well-approximated by a linear term $p_0\mathbf{1}\mathbf{1}^T + \tilde{d}\lambda_1UU^T$, enabling reliable latent-vector recovery up to rotation, hypothesis testing to distinguish two communities, and (near) exact clustering under appropriate regimes. The work also provides lower bounds highlighting information-theoretic limits and discusses connections to the stochastic block model and random geometric graphs, thereby outlining an initial information-computation landscape for GMBMs. Overall, the results establish provable guarantees for spectral embedding in a realistic high-dimensional geometric network model and chart directions for extending to more complex mixtures and non-spherical covariances, with potential implications for network data analysis and community detection in modern, high-dimensional settings.

Abstract

Gaussian mixture block models are distributions over graphs that strive to model modern networks: to generate a graph from such a model, we associate each vertex $i$ with a latent feature vector $u_i \in \mathbb{R}^d$ sampled from a mixture of Gaussians, and we add edge $(i,j)$ if and only if the feature vectors are sufficiently similar, in that $\langle u_i,u_j \rangle \ge τ$ for a pre-specified threshold $τ$. The different components of the Gaussian mixture represent the fact that there may be different types of nodes with different distributions over features -- for example, in a social network each component represents the different attributes of a distinct community. Natural algorithmic tasks associated with these networks are embedding (recovering the latent feature vectors) and clustering (grouping nodes by their mixture component). In this paper we initiate the study of clustering and embedding graphs sampled from high-dimensional Gaussian mixture block models, where the dimension of the latent feature vectors $d\to \infty$ as the size of the network $n \to \infty$. This high-dimensional setting is most appropriate in the context of modern networks, in which we think of the latent feature space as being high-dimensional. We analyze the performance of canonical spectral clustering and embedding algorithms for such graphs in the case of 2-component spherical Gaussian mixtures, and begin to sketch out the information-computation landscape for clustering and embedding in these models.

Spectral clustering in the Gaussian mixture block model

TL;DR

The paper studies clustering and embedding in graphs drawn from a high-dimensional Gaussian mixture block model (GMBM), where each node carries a latent

-dimensional feature from a two-component Gaussian mixture and edges appear if

. It develops and analyzes a canonical spectral algorithm, using a sophisticated trace-method with Gegenbauer polynomials to show that the adjacency matrix

can be well-approximated by a linear term

, enabling reliable latent-vector recovery up to rotation, hypothesis testing to distinguish two communities, and (near) exact clustering under appropriate regimes. The work also provides lower bounds highlighting information-theoretic limits and discusses connections to the stochastic block model and random geometric graphs, thereby outlining an initial information-computation landscape for GMBMs. Overall, the results establish provable guarantees for spectral embedding in a realistic high-dimensional geometric network model and chart directions for extending to more complex mixtures and non-spherical covariances, with potential implications for network data analysis and community detection in modern, high-dimensional settings.

Abstract

Gaussian mixture block models are distributions over graphs that strive to model modern networks: to generate a graph from such a model, we associate each vertex

with a latent feature vector

sampled from a mixture of Gaussians, and we add edge

if and only if the feature vectors are sufficiently similar, in that

for a pre-specified threshold

. The different components of the Gaussian mixture represent the fact that there may be different types of nodes with different distributions over features -- for example, in a social network each component represents the different attributes of a distinct community. Natural algorithmic tasks associated with these networks are embedding (recovering the latent feature vectors) and clustering (grouping nodes by their mixture component). In this paper we initiate the study of clustering and embedding graphs sampled from high-dimensional Gaussian mixture block models, where the dimension of the latent feature vectors

as the size of the network

. This high-dimensional setting is most appropriate in the context of modern networks, in which we think of the latent feature space as being high-dimensional. We analyze the performance of canonical spectral clustering and embedding algorithms for such graphs in the case of 2-component spherical Gaussian mixtures, and begin to sketch out the information-computation landscape for clustering and embedding in these models.

Paper Structure (27 sections, 32 theorems, 236 equations, 1 figure, 1 algorithm)

This paper contains 27 sections, 32 theorems, 236 equations, 1 figure, 1 algorithm.

Introduction
Our results
Spectral algorithm.
Lower bounds.
Related work
Gaussian mixture block model and variations.
Recovering embeddings of random geometric graphs.
Clustering Gaussian mixtures.
Comparison to the Stochastic Block Model.
Directions for future research
Technical overview
Linear approximation of the adjacency matrix.
The trace method.
Accounting for large separation.
Hypothesis testing and clustering.
...and 12 more sections

Key Result

Theorem 1.4

Suppose that $n,d\in \mathbb Z_+$ and $\mu \in \mathbb R_+$, and $p \in [0,1/2-\varepsilon]$ for any constant $\varepsilon>0$, satisfy the conditions $\log^{16} n\ll d < n$, $\mu^2 \leqslant 1/(\sqrt{d}\log n)$, and $pn\gg 1$. Then given $G \sim \boldsymbol{G}_{n,d}(p,\mu)$ generated by latent vecto with high probability as $n$ goes to infinity.

Figures (1)

Figure 1: Diagram illustrating the range of $\mu$ for which we show that the spectral algorithm completes each task successfully (up to logarithmic factors), all under the condition that $1 \mathrel{\mathop{\tiny{\ll}}\limits^{ \hbox{\ex@ $\text{\tiny{log}}$}}} d \mathrel{\mathop{\tiny{\ll}}\limits^{ \hbox{\ex@ $\text{\tiny{log}}$}}} pn$. The solid lines correspond to our theorems. The dashed teal line indicates that beyond $d^{-1/4}$, each community corresponds to a distinct connected component in the graph and thus spectral clustering trivially succeeds. Similarly, the dashed violet line indicates that beyond $d^{-1/4}$, the community labels suffice to recover an approximate embedding. The gray $x$'s mark a range in which clustering/testing is impossible even when the latent embedding is known (lower bounds for clustering in Ndaoud22, for testing in app:kl).

Theorems & Definitions (75)

Definition 1.1: Gaussian mixture block model
Remark 1.2
Theorem 1.4: Latent vector recovery/embedding
Theorem 1.5: Hypothesis Testing
Theorem 1.6: Spectral clustering
Theorem 2.1
Proposition 2.2
Theorem 2.3
Proposition 2.4
Proposition 4.1
...and 65 more

Spectral clustering in the Gaussian mixture block model

TL;DR

Abstract

Spectral clustering in the Gaussian mixture block model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (75)