Table of Contents
Fetching ...

Goodness-of-fit testing of the distribution of posterior classification probabilities for validating model-based clustering

Salima El Kolei, Matthieu Marbac

TL;DR

This work proposes a first-of-its-kind goodness-of-fit test for model-based clustering that targets the distribution of posterior classification probabilities rather than the full data likelihood. By formulating moment conditions on functionals of the posterior probabilities and employing an empirical likelihood ratio with a growing number of moments, the method can detect any meaningful deviation from the assumed clustering model. The approach relies only on a consistent estimator of the model parameters and their posterior classifications, uses data splitting to manage nuisance-parameter effects, and provides asymptotic level guarantees under a flexible growing-dimension framework. Applications to real and simulated data demonstrate the method’s ability to validate clustering relevance across parametric and nonparametric settings and to assess assumptions such as independence within components.

Abstract

We present the first method for assessing the relevance of a model-based clustering result in both parametric and non-parametric frameworks. The method directly aligns with the clustering objective by assessing how well the conditional probabilities of cluster memberships, as defined by the mixture model, fit the data. By focusing on these conditional probabilities, the procedure applies to any type and dimension of data and any mixture model. The testing procedure requires only a consistent estimator of the parameters and the associated conditional probabilities of classification for each observation. Its implementation is straightforward, as no additional estimator is needed. Under the null hypothesis, the method relies on the fact that any functional transformation of the posterior probabilities of classification has the same expectation under both the model being tested and the true model. This goodness-of-fit procedure is based on a empirical likelihood method with an increasing number of moment conditions to asymptotically detect any alternative. Data are split into blocks to account for the use of a parameter estimator, and the empirical log-likelihood ratio is computed for each block. By analyzing the deviation of the maximum empirical log-likelihood ratios, the exact asymptotic significance level of the goodnessof-fit procedure is obtained.

Goodness-of-fit testing of the distribution of posterior classification probabilities for validating model-based clustering

TL;DR

This work proposes a first-of-its-kind goodness-of-fit test for model-based clustering that targets the distribution of posterior classification probabilities rather than the full data likelihood. By formulating moment conditions on functionals of the posterior probabilities and employing an empirical likelihood ratio with a growing number of moments, the method can detect any meaningful deviation from the assumed clustering model. The approach relies only on a consistent estimator of the model parameters and their posterior classifications, uses data splitting to manage nuisance-parameter effects, and provides asymptotic level guarantees under a flexible growing-dimension framework. Applications to real and simulated data demonstrate the method’s ability to validate clustering relevance across parametric and nonparametric settings and to assess assumptions such as independence within components.

Abstract

We present the first method for assessing the relevance of a model-based clustering result in both parametric and non-parametric frameworks. The method directly aligns with the clustering objective by assessing how well the conditional probabilities of cluster memberships, as defined by the mixture model, fit the data. By focusing on these conditional probabilities, the procedure applies to any type and dimension of data and any mixture model. The testing procedure requires only a consistent estimator of the parameters and the associated conditional probabilities of classification for each observation. Its implementation is straightforward, as no additional estimator is needed. Under the null hypothesis, the method relies on the fact that any functional transformation of the posterior probabilities of classification has the same expectation under both the model being tested and the true model. This goodness-of-fit procedure is based on a empirical likelihood method with an increasing number of moment conditions to asymptotically detect any alternative. Data are split into blocks to account for the use of a parameter estimator, and the empirical log-likelihood ratio is computed for each block. By analyzing the deviation of the maximum empirical log-likelihood ratios, the exact asymptotic significance level of the goodnessof-fit procedure is obtained.

Paper Structure

This paper contains 17 sections, 5 theorems, 183 equations, 3 figures, 4 tables.

Key Result

Theorem 1

If Assumptions ass:main hold true then, under the null hypothesis stated by eq:null, the asymptotic level of the testing procedure is equal to $\alpha$ leading that

Figures (3)

  • Figure 1: Quantile-quantile plot comparing the empirical distribution of the posterior classification probabilities for each component with their theoretical distributions under the fitted mixture model.
  • Figure 2: Kernel density estimations for each of the four variables of the Graft-versus-Host Disease data.
  • Figure 3: Quantile-quantile plot comparing the empirical distribution of the posterior classification probabilities for Component 1 with their theoretical distributions under the Gaussian mixture model (on the left) and under the Non-parametric mixture model (on the right).

Theorems & Definitions (11)

  • Remark 1
  • Theorem 1
  • proof : Sketch of Proof of Theorem \ref{['thm:niveau']}
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • proof : Proof of Lemma \ref{['lem:controlMax']}
  • proof : Proof of Lemma \ref{['lem:covmatrix']}
  • proof : Proof of Lemma \ref{['lem:lagrange']}
  • ...and 1 more