Table of Contents
Fetching ...

High-dimensional statistical inference for linkage disequilibrium score regression and its cross-ancestry extensions

Fei Xue, Bingxin Zhao

TL;DR

This work develops a high-dimensional fixed-effect framework for linkage disequilibrium score regression (LDSC) that integrates GWAS summary data with external reference panels while explicitly modeling genome-wide dependence and block-structured LD patterns. It extends LDSC to cross-ancestry settings and derives rigorous asymptotic normality results for both univariate and bivariate estimators under broad, verifiable conditions, providing explicit variance forms. Through simulations and real data (UK Biobank, HAPNEST, and LD-score references), the authors validate genome-wide LDSC performance and reveal limitations when analyses focus on small annotations or sparse genetic architectures. The results offer practical guidance for multi-ancestry genetic studies, including when and how cross-ancestry LD scores should be constructed, and emphasize the continued importance of adequate GWAS sample sizes for reliable heritability and genetic correlation inference.

Abstract

Linkage disequilibrium score regression (LDSC) has emerged as an essential tool for genetic and genomic analyses of complex traits, utilizing high-dimensional data derived from genome-wide association studies (GWAS). LDSC computes the linkage disequilibrium (LD) scores using an external reference panel, and integrates the LD scores with only summary data from the original GWAS. In this paper, we investigate LDSC within a fixed-effect data integration framework, underscoring its ability to merge multi-source GWAS data and reference panels. In particular, we take account of the genome-wide dependence among the high-dimensional GWAS summary statistics, along with the block-diagonal dependence pattern in estimated LD scores. Our analysis uncovers several key factors of both the original GWAS and reference panel datasets that determine the performance of LDSC. We show that it is relatively feasible for LDSC-based estimators to achieve asymptotic normality when applied to genome-wide genetic variants (e.g., in genetic variance and covariance estimation), whereas it becomes considerably challenging when we focus on a much smaller subset of genetic variants (e.g., in partitioned heritability analysis). Moreover, by modeling the disparities in LD patterns across different populations, we unveil that LDSC can be expanded to conduct cross-ancestry analyses using data from distinct global populations (such as European and Asian). We validate our theoretical findings through extensive numerical evaluations using real genetic data from the UK Biobank study.

High-dimensional statistical inference for linkage disequilibrium score regression and its cross-ancestry extensions

TL;DR

This work develops a high-dimensional fixed-effect framework for linkage disequilibrium score regression (LDSC) that integrates GWAS summary data with external reference panels while explicitly modeling genome-wide dependence and block-structured LD patterns. It extends LDSC to cross-ancestry settings and derives rigorous asymptotic normality results for both univariate and bivariate estimators under broad, verifiable conditions, providing explicit variance forms. Through simulations and real data (UK Biobank, HAPNEST, and LD-score references), the authors validate genome-wide LDSC performance and reveal limitations when analyses focus on small annotations or sparse genetic architectures. The results offer practical guidance for multi-ancestry genetic studies, including when and how cross-ancestry LD scores should be constructed, and emphasize the continued importance of adequate GWAS sample sizes for reliable heritability and genetic correlation inference.

Abstract

Linkage disequilibrium score regression (LDSC) has emerged as an essential tool for genetic and genomic analyses of complex traits, utilizing high-dimensional data derived from genome-wide association studies (GWAS). LDSC computes the linkage disequilibrium (LD) scores using an external reference panel, and integrates the LD scores with only summary data from the original GWAS. In this paper, we investigate LDSC within a fixed-effect data integration framework, underscoring its ability to merge multi-source GWAS data and reference panels. In particular, we take account of the genome-wide dependence among the high-dimensional GWAS summary statistics, along with the block-diagonal dependence pattern in estimated LD scores. Our analysis uncovers several key factors of both the original GWAS and reference panel datasets that determine the performance of LDSC. We show that it is relatively feasible for LDSC-based estimators to achieve asymptotic normality when applied to genome-wide genetic variants (e.g., in genetic variance and covariance estimation), whereas it becomes considerably challenging when we focus on a much smaller subset of genetic variants (e.g., in partitioned heritability analysis). Moreover, by modeling the disparities in LD patterns across different populations, we unveil that LDSC can be expanded to conduct cross-ancestry analyses using data from distinct global populations (such as European and Asian). We validate our theoretical findings through extensive numerical evaluations using real genetic data from the UK Biobank study.
Paper Structure (13 sections, 8 theorems, 48 equations, 3 figures)

This paper contains 13 sections, 8 theorems, 48 equations, 3 figures.

Key Result

Lemma 1

Under Condition con1, we have as $\min(n_{r\alpha},p)\to \infty$, where $\rho_{l_a} = \{\hbox{Var}(\widehat{\bm{\ell}}_{a}^T\widehat{\bm{\ell}}_{a})\}^{1/2}\asymp (p/n_{r\alpha})^{1/2}$. Moreover, we have

Figures (3)

  • Figure 1: Illustration of LDSC estimators and block-diagonal LD patterns Univariate LDSC can estimate genetic variance (or heritability) using summary statistics from a GWAS of a particular phenotype A (e.g., depression), coupled with LD scores estimated from a reference panel. Bivariate LDSC can estimate the genetic covariance (or correlation) between phenotypes A and B (such as depression and a brain imaging trait), once again utilizing GWAS summary statistics and LD scores. In this paper, we investigate the theoretical properties of these two LDSC estimators and extend the bivariate LDSC to enable cross-ancestry applications.
  • Figure 2: Univariate LDSC estimator across different sample sizes ($n_{\alpha}$), signal sparsity ($m/p$), and heritability level in the UK Biobank data simulation. We report the results of heritability, which is closely related to genetic variance. We set the heritability $h^2_{\alpha}=0.6$ and $0.3$ in the left and right panels, respectively. We simulate the data with $n_{\alpha}=175,000$, $25,000$, or $5,000$. The horizontal line represents the true heritability.
  • Figure 3: Bivariate LDSC estimator across different sample sizes ($n_{\alpha}$ and $n_{\beta}$), signal sparsity ($m/p$), and sample overlaps. We report the genetic correlation, which is closely related to both genetic covariance and variance. We set the heritability $h^2_{\alpha}=h^2_{\beta}=0.6$ and genetic correlation $\varphi_{\alpha\beta}=0.5$ in the top panels and $h^2_{\alpha}=h^2_{\beta}=0.3$ and $\varphi_{\alpha\beta}=0.25$ in the bottom panels. We simulate the data with $n_{\alpha}=n_{\beta}=175,\!000$ or $25,\!000$. In each panel, we consider three cases of sample overlaps: 1) no sample overlap ($0\%$), 2) half sample overlap ($50\%$), and 3) all samples overlap ($100\%$). The horizontal line represents the true genetic correlations.

Theorems & Definitions (12)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Remark 1
  • Theorem 3
  • proof : Proof sketch of Theorem \ref{['thm_clt_var']}
  • Theorem 4
  • Theorem 5
  • Remark 2
  • ...and 2 more