Finding mixed memberships in categorical data
Huan Qing
TL;DR
This work advances latent mixed-membership analysis for polytomous categorical data by introducing two spectral GoM algorithms that operate on a regularized Laplacian of the response matrix. The GoM-SRSC and GoM-CRSC methods recover subject memberships $\\Pi$ and item parameters $\\Theta$ with scalability and provable consistency under mild sparsity, and they include a fuzzy-modularity based criterion to determine the number of latent classes $K$. A unified framework for evaluating mixed-membership quality in real data is proposed, enabling data-driven selection of $K$ and practical interpretation of latent classes. Extensive simulations and real-data experiments demonstrate favorable accuracy and efficiency, highlighting the approach’s potential for large-scale categorical analyses. The methodology offers a principled, scalable alternative to MCMC/JML with theoretical guarantees and actionable model-selection guidance for GoM in polytomous settings.
Abstract
Latent class analysis, a fundamental problem in categorical data analysis, often encounters overlapping latent classes that introduce further challenges. This paper presents a solution to this problem by focusing on finding latent mixed memberships of subjects in categorical data with polytomous responses. We employ the Grade of Membership (GoM) model, which assigns each subject a membership score in each latent class. To address this, we propose two efficient spectral algorithms for estimating these mixed memberships and other GoM parameters. Our algorithms are based on the singular value decomposition of a regularized Laplacian matrix. We establish their convergence rates under a mild condition on data sparsity. Additionally, we introduce a metric to evaluate the quality of estimated mixed memberships for real-world categorical data and determine the optimal number of latent classes based on this metric. Finally, we demonstrate the practicality of our methods through experiments on both computer-generated and real-world categorical datasets.
