Estimating the number of clusters of a Block Markov Chain
Thomas van Vuren, Thomas Cronk, Jaron Sanders
TL;DR
This work develops a principled method to automatically estimate the number of clusters K in trajectories of Block Markov Chains by combining a trimmed, count-matrix based spectral embedding with density-based clustering. The two-stage approach first uses singular value thresholding to obtain a low-rank embedding and a preliminary K (HatK_spec), then refines K via a density-based clustering on the embedding, with an optional K-means completion. The authors prove asymptotic consistency: HatK_spec consistently recovers rank(p) in Step 1, and the density-based Step 2 recovers K when the information quantity I(α,p) is positive and ell_n grows faster than n; a misclassification bound is provided for the final clustering. Extensive numerical experiments demonstrate robustness and provide insights into embedding dimension choices, path lengths, and comparisons with alternative methods, highlighting both the method's strengths and practical limitations in finite-sample regimes.
Abstract
Clustering algorithms frequently require the number of clusters to be chosen in advance, but it is usually not clear how to do this. To tackle this challenge when clustering within sequential data, we present a method for estimating the number of clusters when the data is a trajectory of a Block Markov Chain. Block Markov Chains are Markov Chains that exhibit a block structure in their transition matrix. The method considers a matrix that counts the number of transitions between different states within the trajectory, and transforms this into a spectral embedding whose dimension is set via singular value thresholding. The number of clusters is subsequently estimated via density-based clustering of this spectral embedding, an approach inspired by literature on the Stochastic Block Model. By leveraging and augmenting recent results on the spectral concentration of random matrices with Markovian dependence, we show that the method is asymptotically consistent - in spite of the dependencies between the count matrix's entries, and even when the count matrix is sparse. We also present a numerical evaluation of our method, and compare it to alternatives.
