Constructing Genetic Risk Scores: Robust Bayesian Approach through Projected Summary Statistics and Flexible Shrinkage

Yuzheng Dun; Nilanjan Chatterjee; Jin Jin; Akihiko Nishimura

Constructing Genetic Risk Scores: Robust Bayesian Approach through Projected Summary Statistics and Flexible Shrinkage

Yuzheng Dun, Nilanjan Chatterjee, Jin Jin, Akihiko Nishimura

TL;DR

This article identifies a potential risk, under a common Bayesian PRS framework, of posterior impropriety when integrating the required GWAS summary statistics and linkage disequilibrium data from distinct sources and proposes a projection of the summary statistics that ensures compatibility between the two sources and in turn a proper behavior of the posterior.

Abstract

Polygenic risk scores (PRS) developed from genome-wide association studies (GWAS) can be used for risk stratification by quantifying the genetic contribution to disease, and many clinical applications have been proposed. Bayesian methods are popular for building PRS because of their natural ability to regularize models and incorporate external information. In this article, we present new theoretical results, methods, and extensive numerical studies to advance Bayesian methods for PRS applications. We identify a potential risk, under a common Bayesian PRS framework, of posterior impropriety when integrating the required GWAS summary statistics and linkage disequilibrium (LD) data from distinct sources. As a principled remedy, we propose a projection of the summary statistics that ensures compatibility between the two sources and in turn a proper behavior of the posterior. We further introduce a new PRS method, with accompanying software, under the less-explored Bayesian bridge prior to more flexibly model varying sparsity levels in effect-size distributions. We extensively benchmark it against alternative Bayesian methods using synthetic and real datasets, quantifying the impact of prior specification and LD estimation strategy. Our proposed PRS-Bridge, equipped with the projection technique and flexible prior, demonstrates the most consistent and generally superior performance across a variety of scenarios.

Constructing Genetic Risk Scores: Robust Bayesian Approach through Projected Summary Statistics and Flexible Shrinkage

TL;DR

Abstract

Paper Structure (32 sections, 5 theorems, 36 equations, 16 figures, 3 tables)

This paper contains 32 sections, 5 theorems, 36 equations, 16 figures, 3 tables.

Introduction
Bayesian PRS Modeling Based on GWAS Summary Statistics and External LD Reference Data
Real-World Motivation: Developing PRS When Individual-Level Data are Scarce
Likelihood Based on GWAS Summary Statistics and External LD Reference Data
Nominal Posterior's Ill-behavior Caused by Mismatch between GWAS Summary Statistics and LD Reference Data
Real Data Demonstrations of Danger from Data Mismatch
Projected GWAS Summary Statistics
Methods
Benchmark PRS methods: LDpred2, PRS-CS, and Lassosum
PRS-Bridge: Robust, Flexible, and Scalable PRS Method
LD Approximation Strategy
Numerical Studies
Plasmode Synthetic Data from Spike-and-slab Model
Real Data Benchmark on Continuous Traits from UK Biobank
Real Data Benchmark on Binary Traits using Summary Statistics from External Sources
...and 17 more sections

Key Result

Theorem 1

The following result holds for almost every realization of $\bm{\beta}_\mathrm{sum}$ under Assumption continuous. Consider a joint nominal posterior distribution $\bm{\beta},\lambda^2_1,...,\lambda^2_P \mathbin{|} \bm{\beta}_\mathrm{sum}, \bm{D}_{\mathrm{ref}}, \tau$ under a heavy-tailed prior on $\

Figures (16)

Figure 1: Traceplots of the samples, for the first three coefficients to explode, generated by PRS-CS when removing the ad hoc constraint on the prior variance. The summary statistics is taken from UK Biobank and the LD matrix from 1000G. The data mismatch results in impropriety of the joint nominal posterior and in the explosion of the Gibbs sampler. The software breaks down after a while due to numerical errors. The use of projected summary statistics ensures proper posterior inference.
Figure 2: Comparison of the Strawderman-Berger prior used in PRS-CS and the bridge prior under varying $\alpha$ values. In the left plot, the different priors are scaled to have the same amount of probability in the region $[-2, 2]$ to facilitate the comparison.
Figure 3: Comparison of out-of-sample prediction performances by LDpred2, PRS-CS, and PRS-Bridge on the plasmode synthetic datasets. We report the average as well as 1.96 times the standard error of $R^2$ across the 100 replications. The causal SNP proportions are varied from 0.01, 0.001, to 0.0005. The effect sizes of causal variants are assumed to be related to allele frequency under a model with no, mild, or strong negative selection. The LDpred2 software uses a banded LD structure with the default LD radius of 3cM. The PRS-CS and PRS-Bridge implementations use the block-diagonal LD approximation, with PRS-Bridge additionally using low-rank approximation.
Figure 4: Out-of-sample prediction $R^2$ of Lassosum, LDpred2, PRS-CS, and PRS-Bridge on the six continuous traits: BMI, resting heart rate (RHR), high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), apolipoprotein A1 (APOEA), and apolipoprotein B (APOEB). We implement each method with two alternative LD reference data sources: 1000G and UK Biobank. "(Banded)" indicates the use of the banded structure in approximating LD, while "(Small-block)" and "(Large-block)" indicate the use of the block structure with the small and large blocks. For Lassosum, LDpred2 and PRS-CS, we only consider the default LD structures in their software. The error bars represent 1.96 times the standard error of $R^2$ across the 100 replications. Since the errors are correlated, overlaps in error bars should not be interpreted as implying the lack of statistically meaningful differences; Figure \ref{['fig:UKBiobank_RE']} in the supplement Section \ref{['sec:relative_performance']} shows $R^2$ of each method relative to Lassosum and indicates a clear trend in their relative performances that remains consistent across the replications.
Figure 5: Out-of-sample prediction performances of Lassosum, LDpred2, PRS-CS, and PRS-Bridge on the five binary disease traits: breast cancer (BC), coronary artery disease (CAD), depression, rheumatoid arthritis (RA), and inflammatory bowel disease (IBD). The methods are implemented in the same manners as described in the caption of Figure \ref{['fig:UKBiobank']}. The error bars should again be interpreted with caution; Figure \ref{['fig:disease_RE']} in the supplement Section \ref{['sec:relative_performance']} provides the transformed AUC of each method relative to Lassosum.
...and 11 more figures

Theorems & Definitions (11)

Theorem 1
Definition S1: Heavy-tailed distribution
Lemma S1
proof
Lemma S2
proof
Lemma S3
proof
proof : Proof of Theorem \ref{['thm2']}
Theorem S1
...and 1 more

Constructing Genetic Risk Scores: Robust Bayesian Approach through Projected Summary Statistics and Flexible Shrinkage

TL;DR

Abstract

Constructing Genetic Risk Scores: Robust Bayesian Approach through Projected Summary Statistics and Flexible Shrinkage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (11)