Table of Contents
Fetching ...

Linked factor analysis

Giuseppe Vinci

Abstract

Factor models are widely applied to the analysis of multivariate data across disparate fields of research. However, modern scientific data are often incomplete, and estimating a factor model from partially observed data can be very challenging. In this work, we show that if the data are structurally incomplete, the factor model likelihood function can be decomposed into a product of likelihood functions for multiple factor models relative to different observed data subsets. If these factor models are linked together by common parameters, we can obtain complete maximum likelihood estimates of the full factor model parameters. We call this modeling framework Linked Factor Analysis (LINFA). LINFA can be used for covariance matrix completion, dependence estimation, dimension reduction, and data completion. We compute the maximum likelihood estimator through an efficient Expectation-Maximization algorithm, accelerated by a novel Group Vertex Tessellation algorithm. We establish the conditions for the consistency and asymptotic normality of the estimator. We design confidence regions, hypothesis tests, bootstrap algorithms, and methods for selecting the number of factors. Finally, we illustrate the application of LINFA in an extensive simulation study and in the analysis of neuroscience data.

Linked factor analysis

Abstract

Factor models are widely applied to the analysis of multivariate data across disparate fields of research. However, modern scientific data are often incomplete, and estimating a factor model from partially observed data can be very challenging. In this work, we show that if the data are structurally incomplete, the factor model likelihood function can be decomposed into a product of likelihood functions for multiple factor models relative to different observed data subsets. If these factor models are linked together by common parameters, we can obtain complete maximum likelihood estimates of the full factor model parameters. We call this modeling framework Linked Factor Analysis (LINFA). LINFA can be used for covariance matrix completion, dependence estimation, dimension reduction, and data completion. We compute the maximum likelihood estimator through an efficient Expectation-Maximization algorithm, accelerated by a novel Group Vertex Tessellation algorithm. We establish the conditions for the consistency and asymptotic normality of the estimator. We design confidence regions, hypothesis tests, bootstrap algorithms, and methods for selecting the number of factors. Finally, we illustrate the application of LINFA in an extensive simulation study and in the analysis of neuroscience data.
Paper Structure (31 sections, 15 theorems, 161 equations, 7 figures, 4 algorithms)

This paper contains 31 sections, 15 theorems, 161 equations, 7 figures, 4 algorithms.

Key Result

Lemma 3.1

Let $\Sigma=\Lambda\Lambda^{\mathrm T}+\Psi$ and $\tilde{\Sigma}=\tilde{\Lambda}\tilde{\Lambda}^{\mathrm T}+\tilde{\Psi}$ be two covariance matrices, where $\Lambda,\tilde{\Lambda}\in\mathbb{R}^{d\times q}$, $\Psi,\tilde{\Psi}\in\mathcal{D}_{++}^{d\times d}$, and $1\le q<(d-1)/2$. Then, $\Sigma_{V_i

Figures (7)

  • Figure 1: Structural missingness in factor analysis:$9$-dim random vectors ($X$'s) depend on $3$ latent factors ($Z$'s). Only the variables denoted by blue circles are observed.
  • Figure 2: Linkage of vertex sets.(A) Linkage graphs $G^{(m)}$ of the sets $V_1=\{1,2,3,4\}$, $V_2=\{3,4,5,6\}$, $V_3=\{5,6,7,8\}$, $V_4=\{7,8,9,10\}$, and $V_5=\{9,10,11,12\}$, for $m=1,2,3$. The sets are $2$-linked but not $3$-linked. (B) Linkage graphs $G^{(m)}$ of the sets $V_1=\{1,2,3,4,5,6\}$, $V_2=\{1,7\}$, $V_3=\{2,8\}$, $V_4=\{3,9\}$, $V_5=\{4,5,10\}$, and $V_6=\{6,11\}$, for $m=1,2,3$. The sets are $1$-linked but not $2$-linked.
  • Figure 3: Group Vertex Tessellation. (A) Vertex tessellation in a sequential observation pattern. The node subsets $W_1,\ldots,W_J$ produced by Algorithm \ref{['algo:gvt']} are denoted by different colors. (B) Vertex tessellation in a nonsequential observation pattern.
  • Figure 4: Uncertainty quantification (settings: $d=100$, $q=3$, $\eta=0.2$, $K=3$, and $n=5000$). (A) Log SD of the LINFA MLEs $\hat{\Lambda}_{ij}$, $\hat{\Psi}_{ii}$, and $\hat{\Sigma}_{ij}$ approximated via Monte Carlo (500 repeats) and plotted against their asymptotic approximations (Equations \ref{['eq:asnorm']}, \ref{['eq:acovsigma']}). (B) Log SD of the LINFA MLE $\hat{\Sigma}_{ij}$ estimated via MLE, nonparametric bootstrap, and parametric bootstrap plotted against the asymptotic approximations. (C) Histogram of the likelihood ratio statistic $\lambda_n$ (Equation \ref{['eq:confreg']}; 5000 repeats). The histogram approximates well the theoretical p.d.f. of $\chi^2_\kappa$ (continuous curve), with degrees of freedom $\kappa=397$ as per Theorem \ref{['theo:confreg']}. (D) Estimated coverage probability of the confidence region for $(\Lambda,\Psi)$ (Equation \ref{['eq:confreg']}) for three levels of nominal coverage $90\%,95\%,99\%$ and $0.1\le\eta\le 0.4$.
  • Figure 5: (A) Average ($\pm$ 2SD) selected number of factors $q_{\rm CV}$, $q_{\rm AIC}$, and $q_{\rm BIC}$ versus ground truth $q_{\rm true}$ (settings: $K=4$; $d=100$). (B) Computational time of the LINFA Algorithm \ref{['algo:mleEM']} (settings: $q=2$, $\eta=0.4$) using the GVT Algorithm \ref{['algo:gvt']} in sec/iteration (top) and as proportion of the time required when optimizing coordinate-wise (bottom).
  • ...and 2 more figures

Theorems & Definitions (31)

  • Definition 3.1: $m$-linkage
  • Lemma 3.1: Linkage condition
  • Theorem 3.1: Uniqueness of the LINFA MLE
  • Theorem 3.2: Optimality of the GVT Algorithm \ref{['algo:gvt']}
  • Theorem 3.3: Convergence of the LINFA MLE EM Algorithm \ref{['algo:mleEM']}
  • Theorem 3.4: Consistency
  • Theorem 3.5: Asymptotic Normality
  • Corollary 3.1
  • Theorem 3.6: Confidence Region
  • Corollary 3.2: Hypothesis Test
  • ...and 21 more