Table of Contents
Fetching ...

Low-degree Lower bounds for clustering in moderate dimension

Alexandra Carpentier, Nicolas Verzelen

TL;DR

A new low-degree polynomial lower bound for the moderate-dimensional case when $d \geq K$ is established, and a novel non-spectral algorithm is provided matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.

Abstract

We study the fundamental problem of clustering $n$ points into $K$ groups drawn from a mixture of isotropic Gaussians in $\mathbb{R}^d$. Specifically, we investigate the requisite minimal distance $Δ$ between mean vectors to partially recover the underlying partition. While the minimax-optimal threshold for $Δ$ is well-established, a significant gap exists between this information-theoretic limit and the performance of known polynomial-time procedures. Although this gap was recently characterized in the high-dimensional regime ($n \leq dK$), it remains largely unexplored in the moderate-dimensional regime ($n \geq dK$). In this manuscript, we address this regime by establishing a new low-degree polynomial lower bound for the moderate-dimensional case when $d \geq K$. We show that while the difficulty of clustering for $n \leq dK$ is primarily driven by dimension reduction and spectral methods, the moderate-dimensional regime involves more delicate phenomena leading to a "non-parametric rate". We provide a novel non-spectral algorithm matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.

Low-degree Lower bounds for clustering in moderate dimension

TL;DR

A new low-degree polynomial lower bound for the moderate-dimensional case when is established, and a novel non-spectral algorithm is provided matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.

Abstract

We study the fundamental problem of clustering points into groups drawn from a mixture of isotropic Gaussians in . Specifically, we investigate the requisite minimal distance between mean vectors to partially recover the underlying partition. While the minimax-optimal threshold for is well-established, a significant gap exists between this information-theoretic limit and the performance of known polynomial-time procedures. Although this gap was recently characterized in the high-dimensional regime (), it remains largely unexplored in the moderate-dimensional regime (). In this manuscript, we address this regime by establishing a new low-degree polynomial lower bound for the moderate-dimensional case when . We show that while the difficulty of clustering for is primarily driven by dimension reduction and spectral methods, the moderate-dimensional regime involves more delicate phenomena leading to a "non-parametric rate". We provide a novel non-spectral algorithm matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.
Paper Structure (85 sections, 36 theorems, 178 equations, 6 figures)

This paper contains 85 sections, 36 theorems, 178 equations, 6 figures.

Key Result

Theorem 1

There exist positive numerical constant $c_0>0$ and $c'_0>0$ such that the following holds for any $d\geq K$. Assume that $D\geq c'_0$ and that for some $c\geq c_0$. Then

Figures (6)

  • Figure 1: Template $G^*$ that we use for defining the upper bound. It is mostly a double chain with, every $L$ node, a fastener, namely two nodes connected through one edge and that are connected to resp. $v_1,v_2$ are added. There are $M-1$ such fasteners.
  • Figure 2: Modification of the template $G$ in Figure \ref{['fig:UB']} to accomodate $d\geq K$. We just replace each edge of $G$ by a simple chain of length $N$.
  • Figure 3: Example of construction of $G_{\Delta}$. The two multigraphs $G^{(1)}$ and $G^{(2)}$ are represented in respectively blue and red. The node matchings are indicated by dotted lines between the concerned nodes. The half-edge pairings are note by segments of colour. Two half-edges flagged with the same colour are paired together. The nodes $v_1^{(1)}, v_1^{(2)}$ resp. $v_2^{(1)},v_2^{(2)}$ are matched together. $G_{\Delta}$ is then displayed in purple. Note that matched nodes appear only once, and that some edges are removed, and some are added - corresponding to the half-edge pairings. One pair of nodes is fully matched and belongs to $\mathbf M_{\mathrm{full}}$, namely all half-edges connected to it are paired (in $\mathbf P$) - and it is therefore isolated in $G_{\Delta}$.
  • Figure 4: Examples of open paths. The picture on the left depicts a simple example of an open path that has an odd number of pairs of half-edges involved, while the picture on the right depicts an example of an open path that has an even number of pairs of half-edges involved. In each case, to create $G_{\Delta}$, the edges such that both half-edges are paired are removed, and we draw an edge through the open extremities.
  • Figure 5: Examples of cycles. The picture on the left depicts an example of a simple cycle generated by two pairs of half-edges. The picture on the right presents a more complicated case of a longer cycle. In each case, the cycles are removed to create $G_{\Delta}$.
  • ...and 1 more figures

Theorems & Definitions (49)

  • Conjecture 1
  • Theorem 1: Low-degree lower bound
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Theorem 3
  • Remark 1: Spectral Methods
  • Remark 2: Connection with tensors
  • ...and 39 more