Latent class analysis for multi-layer categorical data
Huan Qing
TL;DR
This work extends latent class analysis to multi-layer categorical data with polytomous responses by introducing the multi-layer latent class model (multi-layer LCM) and three scalable spectral estimators (LCA-SoR, LCA-DSoG, LCA-SoG). The methods rely on aggregating layer-wise responses via $R_{sum}$, $S_{sum}$, and the debiased $\tilde S_{sum}$ to recover the latent-class assignments through K-means on the leading singular/eigenvectors, with $\Theta_l$ recovered from the estimated latent structure. The authors prove estimation consistency under mild sparsity conditions, show that more layers and a debiased Gram approach improve accuracy (with LCA-DSoG typically performing best), and propose a modularity-based criterion to select the number of latent classes. Experimental results corroborate the theory, demonstrating improved latent-class recovery and robust $K$-estimation in multi-layer polytomous data, with practical implications for psychology, education, and survey research.
Abstract
Traditional categorical data, often collected in psychological tests and educational assessments, are typically single-layer and gathered only once.This paper considers a more general case, multi-layer categorical data with polytomous responses. To model such data, we present a novel statistical model, the multi-layer latent class model (multi-layer LCM). This model assumes that all layers share common subjects and items. To discover subjects' latent classes and other model parameters under this model, we develop three efficient spectral methods based on the sum of response matrices, the sum of Gram matrices, and the debiased sum of Gram matrices, respectively. Within the framework of multi-layer LCM, we demonstrate the estimation consistency of these methods under mild conditions regarding data sparsity. Our theoretical findings reveal two key insights: (1) increasing the number of layers can enhance the performance of the proposed methods, highlighting the advantages of considering multiple layers in latent class analysis; (2) we theoretically show that the algorithm based on the debiased sum of Gram matrices usually performs best. Additionally, we propose an approach that combines the averaged modularity metric with our methods to determine the number of latent classes. Extensive experiments are conducted to support our theoretical results and show the powerfulness of our methods in the task of learning latent classes and estimating the number of latent classes in multi-layer categorical data with polytomous responses.
