Table of Contents
Fetching ...

Estimating the number of clusters of a Block Markov Chain

Thomas van Vuren, Thomas Cronk, Jaron Sanders

TL;DR

This work develops a principled method to automatically estimate the number of clusters K in trajectories of Block Markov Chains by combining a trimmed, count-matrix based spectral embedding with density-based clustering. The two-stage approach first uses singular value thresholding to obtain a low-rank embedding and a preliminary K (HatK_spec), then refines K via a density-based clustering on the embedding, with an optional K-means completion. The authors prove asymptotic consistency: HatK_spec consistently recovers rank(p) in Step 1, and the density-based Step 2 recovers K when the information quantity I(α,p) is positive and ell_n grows faster than n; a misclassification bound is provided for the final clustering. Extensive numerical experiments demonstrate robustness and provide insights into embedding dimension choices, path lengths, and comparisons with alternative methods, highlighting both the method's strengths and practical limitations in finite-sample regimes.

Abstract

Clustering algorithms frequently require the number of clusters to be chosen in advance, but it is usually not clear how to do this. To tackle this challenge when clustering within sequential data, we present a method for estimating the number of clusters when the data is a trajectory of a Block Markov Chain. Block Markov Chains are Markov Chains that exhibit a block structure in their transition matrix. The method considers a matrix that counts the number of transitions between different states within the trajectory, and transforms this into a spectral embedding whose dimension is set via singular value thresholding. The number of clusters is subsequently estimated via density-based clustering of this spectral embedding, an approach inspired by literature on the Stochastic Block Model. By leveraging and augmenting recent results on the spectral concentration of random matrices with Markovian dependence, we show that the method is asymptotically consistent - in spite of the dependencies between the count matrix's entries, and even when the count matrix is sparse. We also present a numerical evaluation of our method, and compare it to alternatives.

Estimating the number of clusters of a Block Markov Chain

TL;DR

This work develops a principled method to automatically estimate the number of clusters K in trajectories of Block Markov Chains by combining a trimmed, count-matrix based spectral embedding with density-based clustering. The two-stage approach first uses singular value thresholding to obtain a low-rank embedding and a preliminary K (HatK_spec), then refines K via a density-based clustering on the embedding, with an optional K-means completion. The authors prove asymptotic consistency: HatK_spec consistently recovers rank(p) in Step 1, and the density-based Step 2 recovers K when the information quantity I(α,p) is positive and ell_n grows faster than n; a misclassification bound is provided for the final clustering. Extensive numerical experiments demonstrate robustness and provide insights into embedding dimension choices, path lengths, and comparisons with alternative methods, highlighting both the method's strengths and practical limitations in finite-sample regimes.

Abstract

Clustering algorithms frequently require the number of clusters to be chosen in advance, but it is usually not clear how to do this. To tackle this challenge when clustering within sequential data, we present a method for estimating the number of clusters when the data is a trajectory of a Block Markov Chain. Block Markov Chains are Markov Chains that exhibit a block structure in their transition matrix. The method considers a matrix that counts the number of transitions between different states within the trajectory, and transforms this into a spectral embedding whose dimension is set via singular value thresholding. The number of clusters is subsequently estimated via density-based clustering of this spectral embedding, an approach inspired by literature on the Stochastic Block Model. By leveraging and augmenting recent results on the spectral concentration of random matrices with Markovian dependence, we show that the method is asymptotically consistent - in spite of the dependencies between the count matrix's entries, and even when the count matrix is sparse. We also present a numerical evaluation of our method, and compare it to alternatives.
Paper Structure (93 sections, 15 theorems, 92 equations, 13 figures, 6 tables, 7 algorithms)

This paper contains 93 sections, 15 theorems, 92 equations, 13 figures, 6 tables, 7 algorithms.

Key Result

Proposition 3.1

Presume asm:assumption1asm:assumption2, and that $\ell_n = \omega(n)$. If $\omega(\sqrt{\ell_n/n}) = \gamma_n = o(\ell_n/n) ,$ then the output of alg:kpre, $\Hat{K}^{\textnormal{spec}}$, equals $\textnormal{rank}(p)$ with high probability as $n \rightarrow \infty$.

Figures (13)

  • Figure 1: Consider the trajectory $X_0, \ldots, X_{\ell_n}$ depicted using thin black arrows, of some . Are there $K=2$ or perhaps $K=3$ clusters, as depicted on the left or right? Here, the dark gray circles represent states, the larger light gray circles represent clusters, and the thick arrows represent the (hidden) low-dimensional transition probabilities between clusters.
  • Figure 2: Scatter plots of the estimated number of clusters as a function of size at three different path lengths, and in four different scenarios ranging from easy to more difficult. Each $95\%$-confidence interval was calculated using $24$ independent replications. The parameters of these are as follows: (a) $\alpha = (0.5; 0.5)$ and $p = ( 0.92, 0.08; 0.12, 0.88 )$, (b) $\alpha = (0.3; 0.3; 0.4)$ and $p = ( 0.05, 0.10, 0.85; 0.40, 0.50, 0.10; 0.05, 0.90, 0.05 )$, (c) $\alpha = (0.2; 0.3; 0.5)$ and $p = ( 0.20, 0.35, 0.45; 0.40, 0.50, 0.10; 0.05, 0.60, 0.35 )$, (d) $K = 10$, $\alpha_k = 1/K$, and a uniformly at random distributed transition matrix; see \ref{['sec:uniform_BMCs']}.
  • Figure 3: (a) Histograms of the relative accuracy of \ref{['alg:kpre', 'alg:kpost']} for as in \ref{['ex:dot_product_model']} with $K=10$, $d=5$, and $v_i\in\mathbb{R}^d$ and $\alpha$ sampled as described in \ref{['sec:low_rank_BMCs']}. Here, $n=1000$, and the path length $\ell_n=n(\ln n)^{\beta}$ is varied with $\beta= 5.0, 3.5, 2.0$ from top to bottom. Each histogram is the result of $500$ independent repetitions. (b) Histograms of the empirical singular value distribution of $\hat{N} / \gamma_n$ where $\gamma_n = (\ell_n/n)^{3/4}$ for some random from \ref{['fig:Low_rank_example_relative_accuracy']} while varying the path length as in (a). The red dots indicate the location of the five largest singular values. The solid red curve represents the theoretic prediction for the limiting distribution from vanwerde2023matrix. The dashed vertical line indicates the location of the threshold $\gamma_n$ which remains fixed due to the rescaling.
  • Figure 4: Histograms of relative accuracy when $r \in \{ 5, 10, 15 \}$ and $\ell_n = n (\ln n)^{\beta}$ with $\beta \in \{ 2, 3, 4 \}$, for uniformly sampled and reversible ; see \ref{['sec:uniform_BMCs', 'sec:reversible_BMCs']}. Here, $n=1000$, $K=10$, and each histogram is the result of $500$ independent replications.
  • Figure 5: Scatter plots of pairs $(\beta,p_0)$ for which each algorithm correctly estimates $K$ at least half of the time. From (a) to (c), each data point is obtained using $24$, $24$, and $384$ independent replications, respectively. Arrows in (c) indicate regions where \ref{['alg:kpre']} outputs values greater/less than 2 at least half of the time.
  • ...and 8 more figures

Theorems & Definitions (26)

  • Proposition 3.1
  • Example 3.2: Dot-product model 10.5555/1777879.1777890athreya2018statistical
  • Theorem 3.3
  • Theorem 3.4: Adapted from ClusterBMC2017
  • Theorem A.1: Adapted from Bounds2023
  • Corollary A.2: Adapted from Bounds2023
  • Lemma A.3
  • proof
  • Definition A.4: Discrepancy property
  • Proposition A.5
  • ...and 16 more