The Exact Risks of Reference Panel-based Regularized Estimators

Buxin Su; Qiang Sun; Xiaochen Yang; Bingxin Zhao

The Exact Risks of Reference Panel-based Regularized Estimators

Buxin Su, Qiang Sun, Xiaochen Yang, Bingxin Zhao

TL;DR

The findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators, and this performance gap widens as the amount of training data increases.

Abstract

Reference panel-based estimators have become widely used in genetic prediction of complex traits due to their ability to address data privacy concerns and reduce computational and communication costs. These estimators estimate the covariance matrix of predictors using an external reference panel, instead of relying solely on the original training data. In this paper, we investigate the performance of reference panel-based $L_1$ and $L_2$ regularized estimators within a unified framework based on approximate message passing (AMP). We uncover several key factors that influence the accuracy of reference panel-based estimators, including the sample sizes of the training data and reference panels, the signal-to-noise ratio, the underlying sparsity of the signal, and the covariance matrix among predictors. Our findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators. Furthermore, we observe that this performance gap widens as the amount of training data increases, highlighting the importance of constructing large-scale reference panels to mitigate this issue. To support our theoretical analysis, we develop a novel non-separable matrix AMP framework capable of handling the complexities introduced by a general covariance matrix and the additional randomness associated with a reference panel. We validate our theoretical results through extensive simulation studies and real data analyses using the UK Biobank database.

The Exact Risks of Reference Panel-based Regularized Estimators

TL;DR

Abstract

and

regularized estimators within a unified framework based on approximate message passing (AMP). We uncover several key factors that influence the accuracy of reference panel-based estimators, including the sample sizes of the training data and reference panels, the signal-to-noise ratio, the underlying sparsity of the signal, and the covariance matrix among predictors. Our findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators. Furthermore, we observe that this performance gap widens as the amount of training data increases, highlighting the importance of constructing large-scale reference panels to mitigate this issue. To support our theoretical analysis, we develop a novel non-separable matrix AMP framework capable of handling the complexities introduced by a general covariance matrix and the additional randomness associated with a reference panel. We validate our theoretical results through extensive simulation studies and real data analyses using the UK Biobank database.

Paper Structure (84 sections, 51 theorems, 454 equations, 12 figures, 1 table)

This paper contains 84 sections, 51 theorems, 454 equations, 12 figures, 1 table.

Introduction
Paper overview
Notation
Reference panel-based estimators
The model
Estimators and risk measures
TEXT regularized estimators with isotropic features
Asymptotic results
A case study to compare TEXT and TEXT
TEXT regularized estimators with general TEXT
Non-separable matrix AMP
Existence of a unique solution to the fixed point equation
Numerical illustration of the calibration between TEXT and TEXT
Asymptotic results
TEXT regularized estimators
...and 69 more sections

Key Result

Theorem 3.1

Let $\{\bm{\beta}_0, \bm{\epsilon}_x,\boldsymbol{\bm{\epsilon}}_s, \bm{\Sigma}, \bm{X}, \mathbf{S}, \bm{W}\}$ be a converging sequence of instances with $\mathbb{P}\left({\bm{\beta}_0}(p) \neq 0 \right) > 0$. Each row of $\mathbf{X}$, $\mathbf{S}$, and $\mathbf{W}$ is i.i.d. Gaussian with mean ${\b and the out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ is Here $z \sim N

Figures (12)

Figure 1: Comparing the theoretical out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as sparsity and heritability vary. The out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') is calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, $n_x~=~$50,000, 100,000, 200,000, and $n_w~=~$50,000, 100,000, 200,000. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. For each level of sparsity or heritability, we compute $A^2_{\textnormal{LW}}(\lambda)$ and $A^2_{\textnormal{L}}(\lambda)$ for various values of $\lambda$ and present the results with the respective best-performing $\lambda$ in the figure. Left: Heritability $h_x^2=h_s^2=0.6$, and sparsity $m/p$ varies from 0.001 to 0.49. Right: Sparsity $m/p=0.05$, and heritability $h_x^2=h_s^2$ varies from 0.01 to 0.99.
Figure 2: Comparing the theoretical out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as $n_w$ and $n_x$ vary. The out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') is calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, and heritability $h_x^2=h_s^2=0.3$. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. Left:$n_x~=~$50,000, 100,000, 200,000, sparsity $m/p=0.001,0.01$, and reference panel size $n_w$ varies from 1000 to 200,000. Vertical lines colored by blue, orange and red are at $n_w~=~$50,000, 100,000, 200,000, whose intersections with the respective same-colored dashed curves represent the prediction accuracy of $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ when $n_w=n_x$. Decimal numbers on the right side, in blue, orange, and red, represent the prediction accuracy of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ under $n_x~=~$50,000, 100,000, 200,000, respectively, with the shorter ticks corresponding to $m/p=0.001$, and longer ticks corresponding to $m/p=0.01$. Right: Sparsity $m/p=0.001,0.005,0.01,0.05$, $n_x$ is equal to $n_w$, and they vary together from $1,000$ to $400,000$.
Figure 3: Illustrating the theoretical out-of-sample $R^2$ and MSE of $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as $\lambda$ varies. The out-of-sample $R^2$ and MSE are calculated according to Theorem \ref{['thm: i.i.d. lasso mse + R2']}. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, heritability $h_x^2=h_s^2=0.6$, $n_x~=~$50,000, 100,000, 200,000, $n_w~=~$50,000, 100,000, 200,000, and $m/p~=~$0.005. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution.
Figure 4: Comparing the theoretical out-of-sample $R^2$ and MSE of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ as $\alpha$ varies. The theoretical results of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') are calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, heritability $h_x^2=h_s^2=0.6$, sparsity $m/p=0.005$, $n_x~=~$50,000, 200,000, and $n_w~=~$50,000, 100,000, 200,000. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. The relationship between $\alpha$ and $\lambda$ is given in Equation \ref{['alpha(lambda)']}.
Figure 5: Comparing the theoretical out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ under different sparsity as $\alpha$ varies. The out-of-sample $R^2$ of $\widehat{\bm{\beta}}_{\textnormal{L}}(\lambda)$ ('Lasso') and $\widehat{\bm{\beta}}_{\textnormal{LW}}(\lambda)$ ('Ref Lasso') is calculated according to Proposition \ref{['prop: i.i.d. non-ref lasso mse + R2']} and Theorem \ref{['thm: i.i.d. lasso mse + R2']}, respectively. Here we set $\mathbf{{\bm{\Sigma}}} = \bm{I}_{p}$, $p~=~$461,488, heritability $h_x^2=h_s^2=0.6$, $n_x~=~$50,000, 200,000, and $n_w~=~$50,000, 100,000, 200,000. Entries of $\bm{\beta}_0$ are i.i.d. random variables following the Bernoulli-Gaussian distribution. Left: Sparsity $m/p=0.001$. Right: Sparsity $m/p=0.05$.
...and 7 more figures

Theorems & Definitions (59)

Definition 1: Heritability
Remark 1
Theorem 3.1
Proposition 1: Theorem 1.5 in bayati2011lasso
Remark 2: Remark on Condition \ref{['cond: comparison techinical']}
Proposition 2
Theorem 4.1
Proposition 3
Proposition 4
Theorem 4.2
...and 49 more

The Exact Risks of Reference Panel-based Regularized Estimators

TL;DR

Abstract

The Exact Risks of Reference Panel-based Regularized Estimators

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (59)