Table of Contents
Fetching ...

A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random

Binh H. Ho, Long Nguyen Chi, TrungTin Nguyen, Binh T. Nguyen, Van Ha Hoang, Christopher Drovandi

TL;DR

This work tackles variable selection in model-based clustering when data are MNAR by proposing a unified SRUW-MNARz framework. It combines a data-driven penalty matrix with explicit MNAR modeling to jointly infer clustering-relevant variables and missingness mechanisms, providing identifiability and selection-consistency guarantees. A two-stage procedure first ranks variables with a penalized GMM on fast-imputed data, then assigns SRUW roles via SRUW-MNARz on incomplete data, yielding strong empirical performance on synthetic and transcriptomic datasets. The approach delivers improved clustering accuracy and imputation quality while maintaining computational efficiency, and suggests extensions to mixed-type data and cluster-adaptive role assignments for broader applicability.

Abstract

Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.

A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random

TL;DR

This work tackles variable selection in model-based clustering when data are MNAR by proposing a unified SRUW-MNARz framework. It combines a data-driven penalty matrix with explicit MNAR modeling to jointly infer clustering-relevant variables and missingness mechanisms, providing identifiability and selection-consistency guarantees. A two-stage procedure first ranks variables with a penalized GMM on fast-imputed data, then assigns SRUW roles via SRUW-MNARz on incomplete data, yielding strong empirical performance on synthetic and transcriptomic datasets. The approach delivers improved clustering accuracy and imputation quality while maintaining computational efficiency, and suggests extensions to mixed-type data and cluster-adaptive role assignments for broader applicability.

Abstract

Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.

Paper Structure

This paper contains 25 sections, 15 theorems, 320 equations, 9 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

Let $(K, m, r, l, {\mathbb{V}})$ and $(K^*, m^*, r^*, l^*, {\mathbb{V}}^*)$ denote two models under the MNARz mechanism. Let ${\bm{\Theta}}_{(K,m,r,l,{\mathbb{V}})} \subseteq {\bm{\Upsilon}}_{(K,m,r,l,{\mathbb{V}})}$ denote the parameter space such that each element ${\bm{\theta}} = ({\bm{\alpha}},

Figures (9)

  • Figure 1: Comparison of four models under MAR and MNAR mechanisms over 20 replications; for ARI/WNRMSE, higher/lower boxplots indicate better performance.
  • Figure 2: Proportions choosing correct relevant variables and cluster components over 20 replications.
  • Figure 3: Mean expression profiles across 18 clusters. Light region indicates irrelevant P.
  • Figure 4: Boxplot of the ARI obtained over 20 replications of simulated data. The theoretical ARIs are represented by a red dashed line.
  • Figure 5: Boxplot of the NRMSE obtained over 20 replications of simulated data
  • ...and 4 more figures

Theorems & Definitions (39)

  • Theorem 1: Informal: Identifiability of the ${\mathbb{S}}{\mathbb{R}}{\mathbb{U}}{\mathbb{W}}$ Model
  • Theorem 2: Informal: BIC consistency for the ${\mathbb{S}}{\mathbb{R}}{\mathbb{U}}{\mathbb{W}}$ model
  • Theorem 3: Informal: Selection consistency of the two-stage procedure
  • Definition 1: ${\mathbb{S}}{\mathbb{R}}$ Model for Complete Data
  • Definition 2: ${\mathbb{S}}{\mathbb{R}}{\mathbb{U}}{\mathbb{W}}$ Model for Complete Data
  • Proposition 1
  • proof : Proof of \ref{['proposition_global_GMM_representation']}
  • proof
  • Proposition 2
  • proof
  • ...and 29 more