Clustering risk in Non-parametric Hidden Markov and I.I.D. Models
Elisabeth Gassiat, Ibrahim Kaddouri, Zacharie Naulet
TL;DR
This work provides a comprehensive theoretical treatment of clustering risk under nonparametric i.i.d. and Hidden Markov Models, clarifying when the Bayes classifier can serve as an effective clusterer and when it cannot. It introduces a central separation quantity, $ abla$ (denoted in the text as $ abla$ or $ abla$-type expressions), and derives tight bounds linking the Bayes risk of clustering to the Bayes risk of classification in both the iid and HMM settings, with exact results in the two-component case. The authors also establish that the plug-in Bayes classifier achieves near-optimal clustering performance, with excess risk decaying at nonparametric rates under Hölder smooth emission densities, and provide extensive supplementary proofs for the main theorems. The results offer theoretical justification for practical HMM clustering procedures and highlight the role of identifiability in enabling nonparametric clustering without restrictive parametric assumptions. Collectively, the paper advances understanding of how latent structure, model identifiability, and population-density separation shape clustering performance in complex temporal models, with concrete implications for nonparametric HMM-based clustering in practice.
Abstract
We conduct an in-depth analysis of the Bayes risk of clustering in the context of Hidden Markov and i.i.d. models. In both settings, we identify the situations where this risk is comparable to the Bayes risk of classification and those where its minimizer, the Bayes clusterer, can be derived from the Bayes classifier. While we demonstrate that clustering based on the Bayes classifier does not always match the optimal Bayes clusterer, we show that this difference is primarily theoretical and that the Bayes classifier remains nearly optimal for clustering. A key quantity emerges, capturing the fundamental difficulty of both classification and clustering tasks. Furthermore, by leveraging the identifiability of HMMs, we establish bounds on the clustering excess risk of a plug-in Bayes classifier in the general nonparametric setting, offering theoretical justification for its widespread use in practice. Simulations further illustrate our findings.
