Table of Contents
Fetching ...

Identifiability and improper solutions in the probabilistic partial least squares regression with unique variance

Takashi Arai

TL;DR

This work tackles identifiability and improper-solution issues in probabilistic PLS regression with unique variances by imposing a norm constraint on the loading matrix, linking the model to factor analysis. Under the constraint, the authors prove identifiability of the probabilistic PLS model and derive conditions on latent-dimension counts, then demonstrate numerically that the constrained model yields interpretable latent scores via biplots and maintains competitive predictive performance relative to classical PLS. They apply the method to HIV protease mutation data, showing that the optimal latent dimension pair is $(p_u, p_v) = (3,5)$ by BIC, and that the model can handle missing values naturally. Additionally, synthetic-data experiments indicate that ML estimates are consistent and asymptotically normal, supporting the statistical soundness of the approach and its potential utility for genomic data analysis where latent structure interpretation is crucial.

Abstract

This paper addresses theoretical issues associated with probabilistic partial least squares (PLS) regression. As in the case of factor analysis, the probabilistic PLS regression with unique variance suffers from the issues of improper solutions and lack of identifiability, both of which causes difficulties in interpreting latent variables and model parameters. Using the fact that the probabilistic PLS regression can be viewed as a special case of factor analysis, we apply a norm constraint prescription on the factor loading matrix in the probabilistic PLS regression, which was recently proposed in the context of factor analysis to avoid improper solutions. Then, we prove that the probabilistic PLS regression with this norm constraint is identifiable. We apply the probabilistic PLS regression to data on amino acid mutations in Human Immunodeficiency Virus (HIV) protease to demonstrate the validity of the norm constraint and to confirm the identifiability numerically. Utilizing the proposed constraint enables the visualization of latent variables via a biplot. We also investigate the sampling distribution of the maximum likelihood estimates (MLE) using synthetically generated data. We numerically observe that MLE is consistent and asymptotically normally distributed.

Identifiability and improper solutions in the probabilistic partial least squares regression with unique variance

TL;DR

This work tackles identifiability and improper-solution issues in probabilistic PLS regression with unique variances by imposing a norm constraint on the loading matrix, linking the model to factor analysis. Under the constraint, the authors prove identifiability of the probabilistic PLS model and derive conditions on latent-dimension counts, then demonstrate numerically that the constrained model yields interpretable latent scores via biplots and maintains competitive predictive performance relative to classical PLS. They apply the method to HIV protease mutation data, showing that the optimal latent dimension pair is by BIC, and that the model can handle missing values naturally. Additionally, synthetic-data experiments indicate that ML estimates are consistent and asymptotically normal, supporting the statistical soundness of the approach and its potential utility for genomic data analysis where latent structure interpretation is crucial.

Abstract

This paper addresses theoretical issues associated with probabilistic partial least squares (PLS) regression. As in the case of factor analysis, the probabilistic PLS regression with unique variance suffers from the issues of improper solutions and lack of identifiability, both of which causes difficulties in interpreting latent variables and model parameters. Using the fact that the probabilistic PLS regression can be viewed as a special case of factor analysis, we apply a norm constraint prescription on the factor loading matrix in the probabilistic PLS regression, which was recently proposed in the context of factor analysis to avoid improper solutions. Then, we prove that the probabilistic PLS regression with this norm constraint is identifiable. We apply the probabilistic PLS regression to data on amino acid mutations in Human Immunodeficiency Virus (HIV) protease to demonstrate the validity of the norm constraint and to confirm the identifiability numerically. Utilizing the proposed constraint enables the visualization of latent variables via a biplot. We also investigate the sampling distribution of the maximum likelihood estimates (MLE) using synthetically generated data. We numerically observe that MLE is consistent and asymptotically normally distributed.

Paper Structure

This paper contains 12 sections, 4 theorems, 37 equations, 5 figures, 1 table.

Key Result

Theorem 1

Let the observed variables $\mathbf{x}$ and $\mathbf{y}$ be generated by the probabilistic PLS regression model of Eq. (eq:likelihood). If the numbers of the latent dimensions satisfy $p_u + p_v < p_x + p_y$, $p_v \le p_x$, and $p_u \le p_y$, and for the scaled factor loading matrix defined by their row norms have the same value for all features, $\mathrm{diag}(\hat{W} \hat{W}^T) = H^2 = h^2 I$,

Figures (5)

  • Figure 1: Negative log-likelihood and BIC as a function of the number of latent dimensions. The cells surrounded by orange dashed lines correspond to models that have not been shown to be identifiable.
  • Figure 2: Predicted vs. observed plot of the proposed model for three objective variables.
  • Figure 3: Boxplots of coefficient of determination on test data with randomly introduced missing values in training data (left) and in test data (right). The red triangles represent the predictive accuracies in the absence of missing values.
  • Figure 4: Biplots of HIV-1 protease mutation data. The markers in the scatter plot are colored from blue to red according to weak to strong drug resistance to the protease inhibitor (Indinavir). The areas of the markers are proportional to the sample sizes of corresponding data points. The dashed circle represents the maximum possible length of the scaled factor loading vectors. The arrow lengths of the factor loading vectors are scaled to match the points of the factor scores $\mathbf{m}^{(x,y)}$ and $\mathbf{m}^{(x)}$.
  • Figure 5: Sampling distribution of the maximum likelihood estimates for model parameters. The true parameters are indicated by the dashed lines. The sample sizes are 1000 for the lower row figures, 3000 for the middle row figures, and 9000 for the upper row figures.

Theorems & Definitions (6)

  • Theorem 1
  • Proposition 1
  • Lemma 1
  • proof : Proof:
  • Lemma 2
  • proof : Proof: