Linked factor analysis

Giuseppe Vinci

Linked factor analysis

Giuseppe Vinci

Abstract

Factor models are widely applied to the analysis of multivariate data across disparate fields of research. However, modern scientific data are often incomplete, and estimating a factor model from partially observed data can be very challenging. In this work, we show that if the data are structurally incomplete, the factor model likelihood function can be decomposed into a product of likelihood functions for multiple factor models relative to different observed data subsets. If these factor models are linked together by common parameters, we can obtain complete maximum likelihood estimates of the full factor model parameters. We call this modeling framework Linked Factor Analysis (LINFA). LINFA can be used for covariance matrix completion, dependence estimation, dimension reduction, and data completion. We compute the maximum likelihood estimator through an efficient Expectation-Maximization algorithm, accelerated by a novel Group Vertex Tessellation algorithm. We establish the conditions for the consistency and asymptotic normality of the estimator. We design confidence regions, hypothesis tests, bootstrap algorithms, and methods for selecting the number of factors. Finally, we illustrate the application of LINFA in an extensive simulation study and in the analysis of neuroscience data.

Linked factor analysis

Abstract

Paper Structure (31 sections, 15 theorems, 161 equations, 7 figures, 4 algorithms)

This paper contains 31 sections, 15 theorems, 161 equations, 7 figures, 4 algorithms.

Introduction
The Linked Factor Analysis Model
Framework
Applications
Maximum Likelihood Estimation
Linkage condition
Expectation-Maximization Algorithm
Complete log-likelihood function
E-step
M-step accelerated by Group Vertex Tessellation
Full algorithm
Statistical properties of the LINFA MLE
Assumptions
Consistency and asymptotic normality
Confidence regions and hypothesis testing
...and 16 more sections

Key Result

Lemma 3.1

Let $\Sigma=\Lambda\Lambda^{\mathrm T}+\Psi$ and $\tilde{\Sigma}=\tilde{\Lambda}\tilde{\Lambda}^{\mathrm T}+\tilde{\Psi}$ be two covariance matrices, where $\Lambda,\tilde{\Lambda}\in\mathbb{R}^{d\times q}$, $\Psi,\tilde{\Psi}\in\mathcal{D}_{++}^{d\times d}$, and $1\le q<(d-1)/2$. Then, $\Sigma_{V_i

Figures (7)

Figure 1: Structural missingness in factor analysis:$9$-dim random vectors ($X$'s) depend on $3$ latent factors ($Z$'s). Only the variables denoted by blue circles are observed.
Figure 2: Linkage of vertex sets.(A) Linkage graphs $G^{(m)}$ of the sets $V_1=\{1,2,3,4\}$, $V_2=\{3,4,5,6\}$, $V_3=\{5,6,7,8\}$, $V_4=\{7,8,9,10\}$, and $V_5=\{9,10,11,12\}$, for $m=1,2,3$. The sets are $2$-linked but not $3$-linked. (B) Linkage graphs $G^{(m)}$ of the sets $V_1=\{1,2,3,4,5,6\}$, $V_2=\{1,7\}$, $V_3=\{2,8\}$, $V_4=\{3,9\}$, $V_5=\{4,5,10\}$, and $V_6=\{6,11\}$, for $m=1,2,3$. The sets are $1$-linked but not $2$-linked.
Figure 3: Group Vertex Tessellation. (A) Vertex tessellation in a sequential observation pattern. The node subsets $W_1,\ldots,W_J$ produced by Algorithm \ref{['algo:gvt']} are denoted by different colors. (B) Vertex tessellation in a nonsequential observation pattern.
Figure 4: Uncertainty quantification (settings: $d=100$, $q=3$, $\eta=0.2$, $K=3$, and $n=5000$). (A) Log SD of the LINFA MLEs $\hat{\Lambda}_{ij}$, $\hat{\Psi}_{ii}$, and $\hat{\Sigma}_{ij}$ approximated via Monte Carlo (500 repeats) and plotted against their asymptotic approximations (Equations \ref{['eq:asnorm']}, \ref{['eq:acovsigma']}). (B) Log SD of the LINFA MLE $\hat{\Sigma}_{ij}$ estimated via MLE, nonparametric bootstrap, and parametric bootstrap plotted against the asymptotic approximations. (C) Histogram of the likelihood ratio statistic $\lambda_n$ (Equation \ref{['eq:confreg']}; 5000 repeats). The histogram approximates well the theoretical p.d.f. of $\chi^2_\kappa$ (continuous curve), with degrees of freedom $\kappa=397$ as per Theorem \ref{['theo:confreg']}. (D) Estimated coverage probability of the confidence region for $(\Lambda,\Psi)$ (Equation \ref{['eq:confreg']}) for three levels of nominal coverage $90\%,95\%,99\%$ and $0.1\le\eta\le 0.4$.
Figure 5: (A) Average ($\pm$ 2SD) selected number of factors $q_{\rm CV}$, $q_{\rm AIC}$, and $q_{\rm BIC}$ versus ground truth $q_{\rm true}$ (settings: $K=4$; $d=100$). (B) Computational time of the LINFA Algorithm \ref{['algo:mleEM']} (settings: $q=2$, $\eta=0.4$) using the GVT Algorithm \ref{['algo:gvt']} in sec/iteration (top) and as proportion of the time required when optimizing coordinate-wise (bottom).
...and 2 more figures

Theorems & Definitions (31)

Definition 3.1: $m$-linkage
Lemma 3.1: Linkage condition
Theorem 3.1: Uniqueness of the LINFA MLE
Theorem 3.2: Optimality of the GVT Algorithm \ref{['algo:gvt']}
Theorem 3.3: Convergence of the LINFA MLE EM Algorithm \ref{['algo:mleEM']}
Theorem 3.4: Consistency
Theorem 3.5: Asymptotic Normality
Corollary 3.1
Theorem 3.6: Confidence Region
Corollary 3.2: Hypothesis Test
...and 21 more

Linked factor analysis

Abstract

Linked factor analysis

Authors

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (31)