Identifiability and improper solutions in the probabilistic partial least squares regression with unique variance
Takashi Arai
TL;DR
This work tackles identifiability and improper-solution issues in probabilistic PLS regression with unique variances by imposing a norm constraint on the loading matrix, linking the model to factor analysis. Under the constraint, the authors prove identifiability of the probabilistic PLS model and derive conditions on latent-dimension counts, then demonstrate numerically that the constrained model yields interpretable latent scores via biplots and maintains competitive predictive performance relative to classical PLS. They apply the method to HIV protease mutation data, showing that the optimal latent dimension pair is $(p_u, p_v) = (3,5)$ by BIC, and that the model can handle missing values naturally. Additionally, synthetic-data experiments indicate that ML estimates are consistent and asymptotically normal, supporting the statistical soundness of the approach and its potential utility for genomic data analysis where latent structure interpretation is crucial.
Abstract
This paper addresses theoretical issues associated with probabilistic partial least squares (PLS) regression. As in the case of factor analysis, the probabilistic PLS regression with unique variance suffers from the issues of improper solutions and lack of identifiability, both of which causes difficulties in interpreting latent variables and model parameters. Using the fact that the probabilistic PLS regression can be viewed as a special case of factor analysis, we apply a norm constraint prescription on the factor loading matrix in the probabilistic PLS regression, which was recently proposed in the context of factor analysis to avoid improper solutions. Then, we prove that the probabilistic PLS regression with this norm constraint is identifiable. We apply the probabilistic PLS regression to data on amino acid mutations in Human Immunodeficiency Virus (HIV) protease to demonstrate the validity of the norm constraint and to confirm the identifiability numerically. Utilizing the proposed constraint enables the visualization of latent variables via a biplot. We also investigate the sampling distribution of the maximum likelihood estimates (MLE) using synthetically generated data. We numerically observe that MLE is consistent and asymptotically normally distributed.
