Clustering risk in Non-parametric Hidden Markov and I.I.D. Models

Elisabeth Gassiat; Ibrahim Kaddouri; Zacharie Naulet

Clustering risk in Non-parametric Hidden Markov and I.I.D. Models

Elisabeth Gassiat, Ibrahim Kaddouri, Zacharie Naulet

TL;DR

This work provides a comprehensive theoretical treatment of clustering risk under nonparametric i.i.d. and Hidden Markov Models, clarifying when the Bayes classifier can serve as an effective clusterer and when it cannot. It introduces a central separation quantity, $ abla$ (denoted in the text as $ abla$ or $ abla$-type expressions), and derives tight bounds linking the Bayes risk of clustering to the Bayes risk of classification in both the iid and HMM settings, with exact results in the two-component case. The authors also establish that the plug-in Bayes classifier achieves near-optimal clustering performance, with excess risk decaying at nonparametric rates under Hölder smooth emission densities, and provide extensive supplementary proofs for the main theorems. The results offer theoretical justification for practical HMM clustering procedures and highlight the role of identifiability in enabling nonparametric clustering without restrictive parametric assumptions. Collectively, the paper advances understanding of how latent structure, model identifiability, and population-density separation shape clustering performance in complex temporal models, with concrete implications for nonparametric HMM-based clustering in practice.

Abstract

We conduct an in-depth analysis of the Bayes risk of clustering in the context of Hidden Markov and i.i.d. models. In both settings, we identify the situations where this risk is comparable to the Bayes risk of classification and those where its minimizer, the Bayes clusterer, can be derived from the Bayes classifier. While we demonstrate that clustering based on the Bayes classifier does not always match the optimal Bayes clusterer, we show that this difference is primarily theoretical and that the Bayes classifier remains nearly optimal for clustering. A key quantity emerges, capturing the fundamental difficulty of both classification and clustering tasks. Furthermore, by leveraging the identifiability of HMMs, we establish bounds on the clustering excess risk of a plug-in Bayes classifier in the general nonparametric setting, offering theoretical justification for its widespread use in practice. Simulations further illustrate our findings.

Clustering risk in Non-parametric Hidden Markov and I.I.D. Models

TL;DR

(denoted in the text as

-type expressions), and derives tight bounds linking the Bayes risk of clustering to the Bayes risk of classification in both the iid and HMM settings, with exact results in the two-component case. The authors also establish that the plug-in Bayes classifier achieves near-optimal clustering performance, with excess risk decaying at nonparametric rates under Hölder smooth emission densities, and provide extensive supplementary proofs for the main theorems. The results offer theoretical justification for practical HMM clustering procedures and highlight the role of identifiability in enabling nonparametric clustering without restrictive parametric assumptions. Collectively, the paper advances understanding of how latent structure, model identifiability, and population-density separation shape clustering performance in complex temporal models, with concrete implications for nonparametric HMM-based clustering in practice.

Abstract

Paper Structure (34 sections, 36 theorems, 251 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 36 theorems, 251 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Setting and definitions
Notations
The model
The problem of clustering
Main results
I.I.D. case
HMM case
A key quantity for the Bayes risk of clustering for both I.I.D. and HMM
Reaching the Bayes risk
I.I.D. setting
HMM setting
Numerical simulations
Discussions and Perspectives
Proof of [4, Theorem 4]
...and 19 more sections

Key Result

Theorem 1

The Bayes classifier and clusterer coincides in the i.i.d. setting with $J=2$. If $J\geq 3$ or the labels are dependent, then there exist distributions for which Bayes classifier and Bayes clusterer differ.

Figures (4)

Figure 1: Example of a matching. Nodes on the left represent the clusters induced by the partition of $\Pi_n$; those on the right are the clusters of $g(Y_{1:n})$. Edges form a matching between the two partitions.
Figure 2: Non-parametric penalized least squares density estimation using the histogram basis for Example \ref{['example_1']} and Example \ref{['example_2']}
Figure 3: Histograms of clusters and clustering errors for Example \ref{['example_1']}
Figure 4: Histograms of clusters and clustering errors for Example \ref{['example_2']}

Theorems & Definitions (54)

Theorem 1: Informal
Theorem 2: Informal
Theorem 3: Informal
Definition 1: Clusterer
Definition 2: Classifier
Remark 1
Remark 2
Theorem 4
Theorem 5
Corollary 1
...and 44 more

Clustering risk in Non-parametric Hidden Markov and I.I.D. Models

TL;DR

Abstract

Clustering risk in Non-parametric Hidden Markov and I.I.D. Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (54)