Table of Contents
Fetching ...

Identifiable factor analysis for mixed continuous and binary variables based on the Gaussian-Grassmann distribution

Takashi Arai

Abstract

We develop a factor analysis for mixed continuous and binary observed variables. To this end, we utilized a recently developed multivariate probability distribution for mixed-type random variables, the Gaussian-Grassmann distribution. In the proposed factor analysis, marginalization over latent variables can be performed analytically, yielding an analytical expression for the distribution of the observed variables. This analytical tractability allows model parameters to be estimated using standard gradient-based optimization techniques. We also address improper solutions associated with maximum likelihood factor analysis. We propose a prescription to avoid improper solutions by imposing a constraint that row vectors of the factor loading matrix have the same norm for all features. Then, we prove that the proposed factor analysis is identifiable under the norm constraint. We demonstrate the validity of this norm constraint prescription and numerically verified the model's identifiability using both real and synthetic datasets. We also compare the proposed model with quantification method and found that the proposed model achieves better reproducibility of correlations than the quantification method.

Identifiable factor analysis for mixed continuous and binary variables based on the Gaussian-Grassmann distribution

Abstract

We develop a factor analysis for mixed continuous and binary observed variables. To this end, we utilized a recently developed multivariate probability distribution for mixed-type random variables, the Gaussian-Grassmann distribution. In the proposed factor analysis, marginalization over latent variables can be performed analytically, yielding an analytical expression for the distribution of the observed variables. This analytical tractability allows model parameters to be estimated using standard gradient-based optimization techniques. We also address improper solutions associated with maximum likelihood factor analysis. We propose a prescription to avoid improper solutions by imposing a constraint that row vectors of the factor loading matrix have the same norm for all features. Then, we prove that the proposed factor analysis is identifiable under the norm constraint. We demonstrate the validity of this norm constraint prescription and numerically verified the model's identifiability using both real and synthetic datasets. We also compare the proposed model with quantification method and found that the proposed model achieves better reproducibility of correlations than the quantification method.

Paper Structure

This paper contains 17 sections, 4 theorems, 55 equations, 8 figures.

Key Result

Theorem 1

Let the observed variables $\mathbf{x}$ and $\mathbf{y}$ be generated by the proposed factor analysis of Eq. (eq:fa_induced) with the following model parameters, If the dimension of the latent space satisfies $q \le p_x + p_y$, the row norms of the dimensionless factor loading matrix $M \equiv [\Psi^{-1/2} W; G]$ are equal for all features, $\mathrm{diag}(M M^T) = c^2 I$, the symmetric matrix $M^

Figures (8)

  • Figure 1: Descriptive statistics of the HIV drug resistance data: histogram of the continuous variables, means of the binary variables and correlation matrix, where the mean and correlation for binary variables are defined using dummy variables.
  • Figure 2: The BIC as a function of the number of latent dimensions for the proposed factor analysis (Left) and factor analysis with quantification (Right).
  • Figure 3: Correlation matrices reproduced by the models and biplots of HIV-1 protease mutation data. The markers in the scatter plot are colored from blue to red according to weak to strong resistance to the protease inhibitor (Indinavir). The dashed circle represents the maximum possible length of the factor loading vectors. The arrow lengths of the factor loading vectors are scaled to match the points of the factor scores $\mathbf{m}$.
  • Figure 4: Scatter plots of empirical correlations and correlations reproduced by the proposed factor analysis as a function of the number of latent dimensions. The green solid lines represent the regression lines obtained by regressing the empirical correlation using the correlation reproduced by the model. $R^2$ indicates the coefficient of determination between empirical correlations and correlations reproduced by the models. In correlations involving binary variables, the point colors change from blue to red as the variance of the binary variable increases. In correlations between binary variables, blue points represent variables for which the mean of one of the binary variables is less than 0.1 or greater than 0.9. In correlations between continuous and binary variables, blue points represent variables for which the mean of the binary variable is less than 0.1 or greater than 0.9.
  • Figure 5: Scatter plots of empirical correlations and correlations reproduced by the factor analysis with the method of quantification as a function of the number of latent dimensions. The green solid lines represent the regression lines obtained by regressing the empirical correlation using the correlation reproduced by the model. $R^2$ indicates the coefficient of determination between empirical correlations and correlations reproduced by the models. In correlations involving binary variables, the point colors change from blue to red as the variance of the binary variable increases. In correlations between binary variables, blue points represent variables for which the mean of one of the binary variables is less than 0.1 or greater than 0.9. In correlations between continuous and binary variables, blue points represent variables for which the mean of the binary variable is less than 0.1 or greater than 0.9.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Proposition 1
  • Lemma 1
  • proof : Proof:
  • Lemma 2
  • proof : Proof: