Table of Contents
Fetching ...

The Exact Risks of Reference Panel-based Regularized Estimators

Buxin Su, Qiang Sun, Xiaochen Yang, Bingxin Zhao

TL;DR

The findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators, and this performance gap widens as the amount of training data increases.

Abstract

Reference panel-based estimators have become widely used in genetic prediction of complex traits due to their ability to address data privacy concerns and reduce computational and communication costs. These estimators estimate the covariance matrix of predictors using an external reference panel, instead of relying solely on the original training data. In this paper, we investigate the performance of reference panel-based $L_1$ and $L_2$ regularized estimators within a unified framework based on approximate message passing (AMP). We uncover several key factors that influence the accuracy of reference panel-based estimators, including the sample sizes of the training data and reference panels, the signal-to-noise ratio, the underlying sparsity of the signal, and the covariance matrix among predictors. Our findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators. Furthermore, we observe that this performance gap widens as the amount of training data increases, highlighting the importance of constructing large-scale reference panels to mitigate this issue. To support our theoretical analysis, we develop a novel non-separable matrix AMP framework capable of handling the complexities introduced by a general covariance matrix and the additional randomness associated with a reference panel. We validate our theoretical results through extensive simulation studies and real data analyses using the UK Biobank database.

The Exact Risks of Reference Panel-based Regularized Estimators

TL;DR

The findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators, and this performance gap widens as the amount of training data increases.

Abstract

Reference panel-based estimators have become widely used in genetic prediction of complex traits due to their ability to address data privacy concerns and reduce computational and communication costs. These estimators estimate the covariance matrix of predictors using an external reference panel, instead of relying solely on the original training data. In this paper, we investigate the performance of reference panel-based and regularized estimators within a unified framework based on approximate message passing (AMP). We uncover several key factors that influence the accuracy of reference panel-based estimators, including the sample sizes of the training data and reference panels, the signal-to-noise ratio, the underlying sparsity of the signal, and the covariance matrix among predictors. Our findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators. Furthermore, we observe that this performance gap widens as the amount of training data increases, highlighting the importance of constructing large-scale reference panels to mitigate this issue. To support our theoretical analysis, we develop a novel non-separable matrix AMP framework capable of handling the complexities introduced by a general covariance matrix and the additional randomness associated with a reference panel. We validate our theoretical results through extensive simulation studies and real data analyses using the UK Biobank database.
Paper Structure (84 sections, 51 theorems, 454 equations, 12 figures, 1 table)

This paper contains 84 sections, 51 theorems, 454 equations, 12 figures, 1 table.

Key Result

Theorem 3.1

Let $\{\bm{\beta}_0, \bm{\epsilon}_x,\boldsymbol{\bm{\epsilon}}_s, \bm{\Sigma}, \bm{X}, \mathbf{S}, \bm{W}\}$ be a converging sequence of instances with $\mathbb{P}\left({\bm{\beta}_0}(p) \neq 0 \right) > 0$. Each row of $\mathbf{X}$, $\mathbf{S}$, and $\mathbf{W}$ is i.i.d. Gaussian with mean ${\b and the out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ is Here $z \sim N

Figures (12)

  • Figure 1: Comparing the theoretical out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as sparsity and heritability vary. The out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') is calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, $n_x~=~$50,000, 100,000, 200,000, and $n_w~=~$50,000, 100,000, 200,000. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. For each level of sparsity or heritability, we compute $A^2_{\textnormal{LW}}(\lambda)$ and $A^2_{\textnormal{L}}(\lambda)$ for various values of $\lambda$ and present the results with the respective best-performing $\lambda$ in the figure. Left: Heritability $h_x^2=h_s^2=0.6$, and sparsity $m/p$ varies from 0.001 to 0.49. Right: Sparsity $m/p=0.05$, and heritability $h_x^2=h_s^2$ varies from 0.01 to 0.99.
  • Figure 2: Comparing the theoretical out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as $n_w$ and $n_x$ vary. The out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') is calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, and heritability $h_x^2=h_s^2=0.3$. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. Left:$n_x~=~$50,000, 100,000, 200,000, sparsity $m/p=0.001,0.01$, and reference panel size $n_w$ varies from 1000 to 200,000. Vertical lines colored by blue, orange and red are at $n_w~=~$50,000, 100,000, 200,000, whose intersections with the respective same-colored dashed curves represent the prediction accuracy of $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ when $n_w=n_x$. Decimal numbers on the right side, in blue, orange, and red, represent the prediction accuracy of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ under $n_x~=~$50,000, 100,000, 200,000, respectively, with the shorter ticks corresponding to $m/p=0.001$, and longer ticks corresponding to $m/p=0.01$. Right: Sparsity $m/p=0.001,0.005,0.01,0.05$, $n_x$ is equal to $n_w$, and they vary together from $1,000$ to $400,000$.
  • Figure 3: Illustrating the theoretical out-of-sample $R^2$ and MSE of $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as $\lambda$ varies. The out-of-sample $R^2$ and MSE are calculated according to Theorem \ref{['thm: i.i.d. lasso mse + R2']}. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, heritability $h_x^2=h_s^2=0.6$, $n_x~=~$50,000, 100,000, 200,000, $n_w~=~$50,000, 100,000, 200,000, and $m/p~=~$0.005. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution.
  • Figure 4: Comparing the theoretical out-of-sample $R^2$ and MSE of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as $\alpha$ varies. The theoretical results of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') are calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, heritability $h_x^2=h_s^2=0.6$, sparsity $m/p=0.005$, $n_x~=~$50,000, 200,000, and $n_w~=~$50,000, 100,000, 200,000. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. The relationship between $\alpha$ and $\lambda$ is given in Equation \ref{['alpha(lambda)']}.
  • Figure 5: Comparing the theoretical out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ under different sparsity as $\alpha$ varies. The out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') is calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, heritability $h_x^2=h_s^2=0.6$, $n_x~=~$50,000, 200,000, and $n_w~=~$50,000, 100,000, 200,000. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. Left: Sparsity $m/p=0.001$. Right: Sparsity $m/p=0.05$.
  • ...and 7 more figures

Theorems & Definitions (59)

  • Definition 1: Heritability
  • Remark 1
  • Theorem 3.1
  • Proposition 1: Theorem 1.5 in bayati2011lasso
  • Remark 2: Remark on Condition \ref{['cond: comparison techinical']}
  • Proposition 2
  • Theorem 4.1
  • Proposition 3
  • Proposition 4
  • Theorem 4.2
  • ...and 49 more