Table of Contents
Fetching ...

Structurally Aware Robust Model Selection for Mixtures

Jiawei Li, Jonathan H. Huggins

TL;DR

A new model selection criteria is proposed that, while model-based, uses available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model and proves a first-of-its-kind consistency result under intuitive assumptions.

Abstract

Mixture models are often used to identify meaningful subpopulations (i.e., clusters) in observed data such that the subpopulations have a real-world interpretation (e.g., as cell types). However, when used for subpopulation discovery, mixture model inference is usually ill-defined a priori because the assumed observation model is only an approximation to the true data-generating process. Thus, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious subpopulations that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent subpopulations cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a first-of-its-kind consistency result under intuitive assumptions. Simulation studies and an application to flow cytometry data demonstrate our model selection criteria consistently finds the correct number of subpopulations.

Structurally Aware Robust Model Selection for Mixtures

TL;DR

A new model selection criteria is proposed that, while model-based, uses available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model and proves a first-of-its-kind consistency result under intuitive assumptions.

Abstract

Mixture models are often used to identify meaningful subpopulations (i.e., clusters) in observed data such that the subpopulations have a real-world interpretation (e.g., as cell types). However, when used for subpopulation discovery, mixture model inference is usually ill-defined a priori because the assumed observation model is only an approximation to the true data-generating process. Thus, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious subpopulations that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent subpopulations cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a first-of-its-kind consistency result under intuitive assumptions. Simulation studies and an application to flow cytometry data demonstrate our model selection criteria consistently finds the correct number of subpopulations.
Paper Structure (32 sections, 7 theorems, 41 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 32 sections, 7 theorems, 41 equations, 10 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

If assump:metric-discr-conditionsassump:model-conditions($\rho$) hold, then $\text{pr}\{\hat{K}_{N}(\rho) = K_{o}\} \to 1$ as $N \to \infty$.

Figures (10)

  • Figure 1: For the mixture of skew-normals example from \ref{['sec:motivation']}, each panel shows the density of $P_{o}$ (dashed lines) and the densities of the fitted Gaussian mixture model and each component distribution (solid lines) using $N = 10\,000$ observations. Results are given for three approaches: expectation--maximization with the Bayesian information criterion (first row), the coarsened posterior (second row), and our robust model selection method (third row).
  • Figure 2: Fitting a Poisson mixture model to data from a mixture of negative binomial distributions (\ref{['sec:choosing-rho']}). Left: Penalized loss plot for $K=1$ (blue), $K=2$ (orange), $K=3$ (green) and $K=4$ (gray). The cross mark indicates the first wide stable region and is labeled with the number of clusters our method selects. Right: Estimated model distribution (orange) compared to the observed data (blue).
  • Figure 3: Comparison between the coarsened posterior and our method when using a Gaussian mixture model to fit data generated from a mixture of skew-normal distributions. First row: Densities of the model and components selected using the coarsened posterior (solid lines) and the density of the data distribution (dashed line). The title specifies the data-generating distribution and the number of components selected. In the middle plot of the first row, the minor cluster contains two components. Second row: Densities of the model and components selected using our structurally aware robust method. Third row: Penalized loss plots, where the cross mark indicates the first wide stable region and is labeled with the number of clusters our method selects. Line colors correspond to different $K$ values. See caption in \ref{['fig:poismm']} for details.
  • Figure 4: Application of our method to simulated high dimensional data. Left: The penalized loss plot for determining the number of componetnts. See \ref{['fig:poismm']} for description of line colors. Right: Selected two-dimensional projections of ground truth.
  • Figure 5: $\rho$ versus F-measure for training datasets 1--6 (solid lines). The black dashed line indicates averaged F-measure over the training datasets.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Lemma 1
  • proof
  • Proposition 5
  • proof