Table of Contents
Fetching ...

A Sparse Linear Model for Positive Definite Estimation of Covariance Matrices

Rakheon Kim, Irina Gaynanova

Abstract

Sparse covariance matrices play crucial roles by encoding the interdependencies between variables in numerous fields such as genetics and neuroscience. Despite substantial studies on sparse covariance matrices, existing methods face several challenges such as the correlation among the elements in the sample covariance matrix, positive definiteness and unbiased estimation of the diagonal elements. To address these challenges, we formulate a linear covariance model for estimating sparse covariance matrices and propose a penalized regression. This method is general enough to encompass existing sparse covariance estimators and can additionally consider correlation among the elements in the sample covariance matrix while avoiding unnecessary bias in the diagonal elements and preserving positive definiteness. We develop a consensus ADMM algorithm for estimation and derive $\ell_2$ convergence rate of the proposed estimator. We apply our estimator to simulated data and real data from neuroscience and genetics to describe the efficacy of our proposed method.

A Sparse Linear Model for Positive Definite Estimation of Covariance Matrices

Abstract

Sparse covariance matrices play crucial roles by encoding the interdependencies between variables in numerous fields such as genetics and neuroscience. Despite substantial studies on sparse covariance matrices, existing methods face several challenges such as the correlation among the elements in the sample covariance matrix, positive definiteness and unbiased estimation of the diagonal elements. To address these challenges, we formulate a linear covariance model for estimating sparse covariance matrices and propose a penalized regression. This method is general enough to encompass existing sparse covariance estimators and can additionally consider correlation among the elements in the sample covariance matrix while avoiding unnecessary bias in the diagonal elements and preserving positive definiteness. We develop a consensus ADMM algorithm for estimation and derive convergence rate of the proposed estimator. We apply our estimator to simulated data and real data from neuroscience and genetics to describe the efficacy of our proposed method.

Paper Structure

This paper contains 26 sections, 1 theorem, 92 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Suppose Assumption ass:eigen_bound holds. Let $\mathcal{S}$ be an index set $\{m:({\hbox{\boldmath $\sigma$}^\ast}_{[o]})_m \neq 0\}$ and let $s=|\mathcal{S}|$ denote the cardinality of $\mathcal{S}$. Let $\widetilde{\hbox{\boldmath $\sigma$}}$ be the minimizer of Define ${\mathbf V}_o^{-\frac{1}{2}} \in \mathbb{R}^{p(p+1)/2 \times p(p-1)/2}$ as a sub-matrix of ${\mathbf V}^{-\frac{1}{2}}$ contai

Figures (5)

  • Figure 1: Estimators of diagonal elements in $\hbox{\boldmath $\Sigma$}^\ast$ for one dataset with $p=20$ and the sample size of $n=100$ or $n=1000$, simulated from $\mathbf{N}({\mathbf 0}, \hbox{\boldmath $\Sigma$}^\ast)$. The covariance model for $\hbox{\boldmath $\Sigma$}^\ast$ is the first-order moving average with the diagonal elements equal to one and the first off-diagonal elements equal to 0.5. For different values of penalty parameter $\lambda$ from 0 to 0.5, $\hbox{\boldmath $\Sigma$}^\ast$ has been estimated by the $\ell_1$-penalized likelihood \ref{['eq:log-lik']}. Estimators of 20 diagonal elements are shown in gray dots and the average of those 20 elements is shown in a black dot for each value of $\lambda$. The true parameter is equal to one and the bias in the diagonal estimators increases as $\lambda$ increases.
  • Figure 2: ROC curves of Soft (black dotted), SpCov (green dotted), ProxCov (blue dotted), the proposed SpLCM (red dotted) and the oracle SpLCM(O) (red solid) for 5 simulated datasets. The values on x-axis represent sensitivity and the values on y-axis represent specificity.
  • Figure 3: Heatmap of the covariance matrix by Soft, SpCov, ProxCov and SpLCM. Positive values are shown in red and negative values are shown in blue.
  • Figure 4: Cluster dendrogram from the hierarchical clustering with the correlation matrix by Sample (top) Soft (middle) and SpLCM (bottom). In the bottom panel by SpLCM, the genes were clustered by two major branches, denoted as A and B with 15 clusters and 3 clusters, respectively.
  • Figure 5: Clusters of genes identified by SpLCM shown in circles and the precision matrix graph by gu2015local shown with dashed lines. The genes under the branches A and B in Figure \ref{['fig:den']} are contained in boxes A and B, respectively.

Theorems & Definitions (1)

  • Proposition 1