Table of Contents
Fetching ...

Consistency of heritability estimation from summary statistics in high-dimensional linear models

David Azriel, Samuel Davenport, Armin Schwartzman

TL;DR

This work analyzes the consistency of SNP-heritability estimators that operate on summary statistics in high-dimensional polygenic linear models. It decomposes two popular estimators, LDSC regression (fixed intercept) and GWASH, and derives sufficient and necessary conditions for their consistency under Gaussian and non-Gaussian predictors, using weak dependence (WD) and bounded-kurtosis effects (BKE). The authors show that weighting does not affect consistency under proper truncation, while standardization can induce bias when WD/BKE fail; population stratification introduces bias for all estimators, and free-intercept LDSC does not fully correct it. Through theory and simulations, the paper clarifies when these estimators are reliable and provides guidance on practical implementation, including when to avoid these estimators in the presence of stratification. The results offer a principled basis for using summary-statistic–based heritability estimation in large-scale genetic studies and beyond.

Abstract

In Genome-Wide Association Studies (GWAS), heritability is defined as the fraction of variance of an outcome explained by a large number of genetic predictors in a high-dimensional polygenic linear model. This work studies the asymptotic properties of the most common estimator of heritability from summary statistics called linkage disequilibrium score (LDSC) regression, together with a simpler and closely related estimator called GWAS heritability (GWASH). These estimators are analyzed in their basic versions and under various modifications used in practice including weighting and standardization. We show that, with some variations, two conditions which we call weak dependence (WD) and bounded-kurtosis effects (BKE) are sufficient for consistency of both the basic LDSC with fixed intercept and GWASH estimators, for both Gaussian and non-Gaussian predictors. For Gaussian predictors it is shown that these conditions are also necessary for consistency of GWASH (with truncation) and simulations suggest that necessity holds too when the predictors are non-Gaussian. We also show that, with properly truncated weights, weighting does not change the consistency results, but standardization of the predictors and outcome, as done in practice, introduces bias in both LDSC and GWASH if the two essential conditions are violated. Finally, we show that, when population stratification is present, all the estimators considered are biased, and the bias is not remedied by using the LDSC regression estimator with free intercept, as originally suggested by the authors of that estimator.

Consistency of heritability estimation from summary statistics in high-dimensional linear models

TL;DR

This work analyzes the consistency of SNP-heritability estimators that operate on summary statistics in high-dimensional polygenic linear models. It decomposes two popular estimators, LDSC regression (fixed intercept) and GWASH, and derives sufficient and necessary conditions for their consistency under Gaussian and non-Gaussian predictors, using weak dependence (WD) and bounded-kurtosis effects (BKE). The authors show that weighting does not affect consistency under proper truncation, while standardization can induce bias when WD/BKE fail; population stratification introduces bias for all estimators, and free-intercept LDSC does not fully correct it. Through theory and simulations, the paper clarifies when these estimators are reliable and provides guidance on practical implementation, including when to avoid these estimators in the presence of stratification. The results offer a principled basis for using summary-statistic–based heritability estimation in large-scale genetic studies and beyond.

Abstract

In Genome-Wide Association Studies (GWAS), heritability is defined as the fraction of variance of an outcome explained by a large number of genetic predictors in a high-dimensional polygenic linear model. This work studies the asymptotic properties of the most common estimator of heritability from summary statistics called linkage disequilibrium score (LDSC) regression, together with a simpler and closely related estimator called GWAS heritability (GWASH). These estimators are analyzed in their basic versions and under various modifications used in practice including weighting and standardization. We show that, with some variations, two conditions which we call weak dependence (WD) and bounded-kurtosis effects (BKE) are sufficient for consistency of both the basic LDSC with fixed intercept and GWASH estimators, for both Gaussian and non-Gaussian predictors. For Gaussian predictors it is shown that these conditions are also necessary for consistency of GWASH (with truncation) and simulations suggest that necessity holds too when the predictors are non-Gaussian. We also show that, with properly truncated weights, weighting does not change the consistency results, but standardization of the predictors and outcome, as done in practice, introduces bias in both LDSC and GWASH if the two essential conditions are violated. Finally, we show that, when population stratification is present, all the estimators considered are biased, and the bias is not remedied by using the LDSC regression estimator with free intercept, as originally suggested by the authors of that estimator.

Paper Structure

This paper contains 58 sections, 26 theorems, 301 equations, 8 figures, 1 table.

Key Result

Theorem 1

${\rm E}_{{\boldsymbol \beta}} (h^2_{\rm \beta} - h^2)^2 \to 0$ as $m\to\infty$ iff conditions BKE and WD$_0$ hold.

Figures (8)

  • Figure A1: Performance of $\hat{h}^2_{{\rm GWASH}}$ under weak dependence (AR(1) correlation; left) and strong dependence (equi-correlation; right), with and without standardization. Under weak dependence the estimates have decreasing SE and minimal bias tending to zero. Under strong dependence the SE remains constant and the estimates are biased in the standardized case. Simulation SE was at most 0.01 for all plots.
  • Figure A2: Performance of $\hat{h}^2_{{\rm GWASH}}$ for $t$-distributed $\beta$ coefficients, including the Gaussian case as a reference. Heavy tails can cause large bias and SE when the degrees of freedom are 2.5 or less. Standardization ameliorates the effect for the SE but not the bias. Simulation SE was at most 0.005 for the standardized data and seems unbounded in some cases for the unstandardized data.
  • Figure A3: The effect of weighting by comparing $\hat{h}^2_{\rm GWASH}$ and $\hat{h}^2_{\rm LDSC}$ with their weighted versions $\hat{h}^2_{\rm GWASH-W}$ and $\hat{h}^2_{\rm LDSC-W}$ under the non-stationary correlation structure of Section \ref{['SS:weighting']}. Weighting decreases the SE of the estimators for the standardized data, but has minimal impact for the unstandardized data. The bias is small in both cases. Simulation SE was at most 0.006 for all plots.
  • Figure A4: The severe bias of $\hat{h}^2_{\rm GWASH}, \hat{h}^2_{\rm LDSC}$ and $\hat{h}^2_{\rm LDSC-free}$ under population stratification. The theoretical bias predicted by Theorem \ref{['thm:free.bias']} (yellow) provides a close estimate of the bias for $\text{var}(f) \geq 0.05$, but breaks down at lower values. As $\text{var}(f)$ increases to high levels $(> 0.1)$ the bias decreases though is still unacceptably high, being larger than 0.3. These results demonstrate that the free intercept does not solve the bias as was originally claimed in Bulik:2015. The simulation SE was at most 0.25 for both plots.
  • Figure A5: Performance of $\hat{h}^2_{\rm LDSC}$ under AR(1) correlation representing weak dependence (left) and equi-correlation representing strong dependence (right), with and without standardization. The results can be interpreted as in Figure \ref{['fig:weakvsstrong']}. Simulation SE was at most 0.01 for all plots.
  • ...and 3 more figures

Theorems & Definitions (26)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Corollary 1
  • Theorem 7
  • Corollary 2
  • ...and 16 more