Estimating the number of clusters of a Block Markov Chain

Thomas van Vuren; Thomas Cronk; Jaron Sanders

Estimating the number of clusters of a Block Markov Chain

Thomas van Vuren, Thomas Cronk, Jaron Sanders

TL;DR

This work develops a principled method to automatically estimate the number of clusters K in trajectories of Block Markov Chains by combining a trimmed, count-matrix based spectral embedding with density-based clustering. The two-stage approach first uses singular value thresholding to obtain a low-rank embedding and a preliminary K (HatK_spec), then refines K via a density-based clustering on the embedding, with an optional K-means completion. The authors prove asymptotic consistency: HatK_spec consistently recovers rank(p) in Step 1, and the density-based Step 2 recovers K when the information quantity I(α,p) is positive and ell_n grows faster than n; a misclassification bound is provided for the final clustering. Extensive numerical experiments demonstrate robustness and provide insights into embedding dimension choices, path lengths, and comparisons with alternative methods, highlighting both the method's strengths and practical limitations in finite-sample regimes.

Abstract

Clustering algorithms frequently require the number of clusters to be chosen in advance, but it is usually not clear how to do this. To tackle this challenge when clustering within sequential data, we present a method for estimating the number of clusters when the data is a trajectory of a Block Markov Chain. Block Markov Chains are Markov Chains that exhibit a block structure in their transition matrix. The method considers a matrix that counts the number of transitions between different states within the trajectory, and transforms this into a spectral embedding whose dimension is set via singular value thresholding. The number of clusters is subsequently estimated via density-based clustering of this spectral embedding, an approach inspired by literature on the Stochastic Block Model. By leveraging and augmenting recent results on the spectral concentration of random matrices with Markovian dependence, we show that the method is asymptotically consistent - in spite of the dependencies between the count matrix's entries, and even when the count matrix is sparse. We also present a numerical evaluation of our method, and compare it to alternatives.

Estimating the number of clusters of a Block Markov Chain

TL;DR

Abstract

Paper Structure (93 sections, 15 theorems, 92 equations, 13 figures, 6 tables, 7 algorithms)

This paper contains 93 sections, 15 theorems, 92 equations, 13 figures, 6 tables, 7 algorithms.

Introduction
Main results
Overview of related literature
Clustering in BMCs
Clustering in nonsynthetic, sequential data
Estimating the number of communities in SBMs
Hidden Markov Models (HMMs)
Preliminaries
Block Markov Chains (BMCs)
Generating BMCs at finite n
Approximate cluster assignment
Asymptotic notation
The algorithm
Step 1: Singular value thresholding on a trimmed count matrix
Consistency result
...and 78 more sections

Key Result

Proposition 3.1

Presume asm:assumption1asm:assumption2, and that $\ell_n = \omega(n)$. If $\omega(\sqrt{\ell_n/n}) = \gamma_n = o(\ell_n/n) ,$ then the output of alg:kpre, $\Hat{K}^{\textnormal{spec}}$, equals $\textnormal{rank}(p)$ with high probability as $n \rightarrow \infty$.

Figures (13)

Figure 1: Consider the trajectory $X_0, \ldots, X_{\ell_n}$ depicted using thin black arrows, of some . Are there $K=2$ or perhaps $K=3$ clusters, as depicted on the left or right? Here, the dark gray circles represent states, the larger light gray circles represent clusters, and the thick arrows represent the (hidden) low-dimensional transition probabilities between clusters.
Figure 2: Scatter plots of the estimated number of clusters as a function of size at three different path lengths, and in four different scenarios ranging from easy to more difficult. Each $95\%$-confidence interval was calculated using $24$ independent replications. The parameters of these are as follows: (a) $\alpha = (0.5; 0.5)$ and $p = ( 0.92, 0.08; 0.12, 0.88 )$, (b) $\alpha = (0.3; 0.3; 0.4)$ and $p = ( 0.05, 0.10, 0.85; 0.40, 0.50, 0.10; 0.05, 0.90, 0.05 )$, (c) $\alpha = (0.2; 0.3; 0.5)$ and $p = ( 0.20, 0.35, 0.45; 0.40, 0.50, 0.10; 0.05, 0.60, 0.35 )$, (d) $K = 10$, $\alpha_k = 1/K$, and a uniformly at random distributed transition matrix; see \ref{['sec:uniform_BMCs']}.
Figure 3: (a) Histograms of the relative accuracy of \ref{['alg:kpre', 'alg:kpost']} for as in \ref{['ex:dot_product_model']} with $K=10$, $d=5$, and $v_i\in\mathbb{R}^d$ and $\alpha$ sampled as described in \ref{['sec:low_rank_BMCs']}. Here, $n=1000$, and the path length $\ell_n=n(\ln n)^{\beta}$ is varied with $\beta= 5.0, 3.5, 2.0$ from top to bottom. Each histogram is the result of $500$ independent repetitions. (b) Histograms of the empirical singular value distribution of $\hat{N} / \gamma_n$ where $\gamma_n = (\ell_n/n)^{3/4}$ for some random from \ref{['fig:Low_rank_example_relative_accuracy']} while varying the path length as in (a). The red dots indicate the location of the five largest singular values. The solid red curve represents the theoretic prediction for the limiting distribution from vanwerde2023matrix. The dashed vertical line indicates the location of the threshold $\gamma_n$ which remains fixed due to the rescaling.
Figure 4: Histograms of relative accuracy when $r \in \{ 5, 10, 15 \}$ and $\ell_n = n (\ln n)^{\beta}$ with $\beta \in \{ 2, 3, 4 \}$, for uniformly sampled and reversible ; see \ref{['sec:uniform_BMCs', 'sec:reversible_BMCs']}. Here, $n=1000$, $K=10$, and each histogram is the result of $500$ independent replications.
Figure 5: Scatter plots of pairs $(\beta,p_0)$ for which each algorithm correctly estimates $K$ at least half of the time. From (a) to (c), each data point is obtained using $24$, $24$, and $384$ independent replications, respectively. Arrows in (c) indicate regions where \ref{['alg:kpre']} outputs values greater/less than 2 at least half of the time.
...and 8 more figures

Theorems & Definitions (26)

Proposition 3.1
Example 3.2: Dot-product model 10.5555/1777879.1777890athreya2018statistical
Theorem 3.3
Theorem 3.4: Adapted from ClusterBMC2017
Theorem A.1: Adapted from Bounds2023
Corollary A.2: Adapted from Bounds2023
Lemma A.3
proof
Definition A.4: Discrepancy property
Proposition A.5
...and 16 more

Estimating the number of clusters of a Block Markov Chain

TL;DR

Abstract

Estimating the number of clusters of a Block Markov Chain

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (26)