Table of Contents
Fetching ...

Finding mixed memberships in categorical data

Huan Qing

TL;DR

This work advances latent mixed-membership analysis for polytomous categorical data by introducing two spectral GoM algorithms that operate on a regularized Laplacian of the response matrix. The GoM-SRSC and GoM-CRSC methods recover subject memberships $\\Pi$ and item parameters $\\Theta$ with scalability and provable consistency under mild sparsity, and they include a fuzzy-modularity based criterion to determine the number of latent classes $K$. A unified framework for evaluating mixed-membership quality in real data is proposed, enabling data-driven selection of $K$ and practical interpretation of latent classes. Extensive simulations and real-data experiments demonstrate favorable accuracy and efficiency, highlighting the approach’s potential for large-scale categorical analyses. The methodology offers a principled, scalable alternative to MCMC/JML with theoretical guarantees and actionable model-selection guidance for GoM in polytomous settings.

Abstract

Latent class analysis, a fundamental problem in categorical data analysis, often encounters overlapping latent classes that introduce further challenges. This paper presents a solution to this problem by focusing on finding latent mixed memberships of subjects in categorical data with polytomous responses. We employ the Grade of Membership (GoM) model, which assigns each subject a membership score in each latent class. To address this, we propose two efficient spectral algorithms for estimating these mixed memberships and other GoM parameters. Our algorithms are based on the singular value decomposition of a regularized Laplacian matrix. We establish their convergence rates under a mild condition on data sparsity. Additionally, we introduce a metric to evaluate the quality of estimated mixed memberships for real-world categorical data and determine the optimal number of latent classes based on this metric. Finally, we demonstrate the practicality of our methods through experiments on both computer-generated and real-world categorical datasets.

Finding mixed memberships in categorical data

TL;DR

This work advances latent mixed-membership analysis for polytomous categorical data by introducing two spectral GoM algorithms that operate on a regularized Laplacian of the response matrix. The GoM-SRSC and GoM-CRSC methods recover subject memberships and item parameters with scalability and provable consistency under mild sparsity, and they include a fuzzy-modularity based criterion to determine the number of latent classes . A unified framework for evaluating mixed-membership quality in real data is proposed, enabling data-driven selection of and practical interpretation of latent classes. Extensive simulations and real-data experiments demonstrate favorable accuracy and efficiency, highlighting the approach’s potential for large-scale categorical analyses. The methodology offers a principled, scalable alternative to MCMC/JML with theoretical guarantees and actionable model-selection guidance for GoM in polytomous settings.

Abstract

Latent class analysis, a fundamental problem in categorical data analysis, often encounters overlapping latent classes that introduce further challenges. This paper presents a solution to this problem by focusing on finding latent mixed memberships of subjects in categorical data with polytomous responses. We employ the Grade of Membership (GoM) model, which assigns each subject a membership score in each latent class. To address this, we propose two efficient spectral algorithms for estimating these mixed memberships and other GoM parameters. Our algorithms are based on the singular value decomposition of a regularized Laplacian matrix. We establish their convergence rates under a mild condition on data sparsity. Additionally, we introduce a metric to evaluate the quality of estimated mixed memberships for real-world categorical data and determine the optimal number of latent classes based on this metric. Finally, we demonstrate the practicality of our methods through experiments on both computer-generated and real-world categorical datasets.
Paper Structure (21 sections, 6 theorems, 18 equations, 18 figures, 5 tables, 4 algorithms)

This paper contains 21 sections, 6 theorems, 18 equations, 18 figures, 5 tables, 4 algorithms.

Key Result

Lemma 1

Let $\mathscr{L}_{\tau}=U\Sigma V'$ be $\mathscr{L}_{\tau}$'s top-$K$ SVD such that $\Sigma=\mathrm{diag}(\sigma_{1}(\mathscr{L}_{\tau}), \sigma_{2}(\mathscr{L}_{\tau}),\ldots,\sigma_{K}(\mathscr{L}_{\tau}))$, $U=[\eta_{1},\eta_{2},\ldots,\eta_{K}]$, and $V=[\xi_{1},\xi_{2},\ldots,\xi_{K}]$, where $

Figures (18)

  • Figure 1: Panel (a): illustration of the simplex structure of $U_{\tau}$ with $K=3$, where dots denote rows of $U_{\tau}$. Panel (b): illustration of the cone structure of $U_{*}$ with $K=3$, where dots denote rows of $U_{*}$. For both panels, red dots denote pure rows and green dots represent mixed rows. For both panels, the gray plane denotes the hyperplane formed by pure rows. For the simplex structure in panel (a), all mixed rows of $U_{\tau}$ (i.e., green dots) lie in the triangle formed by the three rows of $U_{\tau}(\mathcal{I},:)$. For the cone structure in panel (b), all mixed rows of $U_{*}$ (i.e., green dots) locate at one side of the hyperplane formed by the three rows of $U_{*}(\mathcal{I},:)$. For both panels, the settings of $\Pi$ and $\Theta$ are the same as that of Experiment 1 in Section \ref{['SecSim']}. Points in this figure have been projected and rotated from $\mathbb{R}^{3}$ to $\mathbb{R}^{2}$ for visualization.
  • Figure 2: Panel (a): row vectors of $U_{\tau}$ and $\hat{U}_{\tau}$ projected from $\mathbb{R}^{3}$ to $\mathbb{R}^{2}$. Panel (b): row vectors of $U_{*}$ and $\hat{U}_{*}$. For both panels, red dots (vertexes of the two red triangles) represent pure nodes, green dots represent mixed nodes, and the two gray planes denote the hyperplanes formed by pure rows. For panel (a), cyan squares represent rows of $\hat{U}_{\tau}$ and blue squares (vertexes of the blue triangle in panel (a)) represent the estimated pure nodes in $\hat{U}_{\tau}$ found by the SP algorithm. For panel (b), cyan squares represent rows of $\hat{U}_{*}$ and blue squares (vertexes of the blue triangle in panel (b)) represent the estimated pure nodes in $\hat{U}_{*}$ found by the SVM-cone algorithm. For both panels, the settings of $\Pi$ and $\Theta$ are the same as those of Experiment 3 in Section \ref{['SecSim']} when $N=3200, J=800, K=3,$ and $\rho=1$.
  • Figure 3: Flowchart of Algorithm \ref{['alg:SRSC']}.
  • Figure 4: Flowchart of Algorithm \ref{['alg:CRSC']}.
  • Figure 5: Pipeline of estimating $K$ for observed response matrix $R$ by combing the fuzzy modularity and algorithm $\mathcal{M}$.
  • ...and 13 more figures

Theorems & Definitions (15)

  • Remark 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1
  • Remark 2
  • Definition 1
  • proof
  • proof
  • proof
  • ...and 5 more