Table of Contents
Fetching ...

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muhammad Muneeb, David B. Ascher

Abstract

Objective: SNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. Methods: We benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, and SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. Results: Heritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) being negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h^2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2, with both being non-significant. Conclusion: SNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input.

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Abstract

Objective: SNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. Methods: We benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, and SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. Results: Heritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) being negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h^2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2, with both being non-significant. Conclusion: SNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input.

Paper Structure

This paper contains 31 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Heatmap of mean SNP heritability estimates across all 86 configurations and 10 phenotypes. Each row corresponds to one unique estimation configuration defined by the tool, model variant, clumping and pruning setting, and number of PCA components included; each column corresponds to one phenotype. Cell values represent the mean $h^2$ averaged across five cross-validation folds for the training set. Grey cells (—) indicate configurations for which no valid estimate was obtained for that phenotype specifically for LDpred2 when the number of variants are less than 15 percent between GWAS and the genotype data in settings when all the snps not excluding Hapmap were used. Negative values are concentrated in unconstrained Haseman--Elston regression configurations and reflect finite-sample sampling variability rather than estimator failure. Fold-level standard errors and 95% confidence intervals for all configurations are reported in Supplementary Table S1 (Heritability Estimates).
  • Figure 2: Distribution of SNP heritability estimates across method families. Each box summarises the mean $h^2$ values produced by one method family across all configurations and phenotypes. The central line is the median, box edges are the interquartile range, and whiskers extend to 1.5$\times$IQR; individual points beyond the whiskers are shown. The dashed red line marks $h^2 = 0$; estimates below this line reflect finite-sample sampling variability in unconstrained Haseman--Elston regression variants and are retained as benchmark outputs. Panel B shows the same estimates as a strip plot coloured by phenotype, illustrating which phenotypes drive extreme values within each family.
  • Figure 3: Inter-method family correlation matrix for heritability estimates. Pairwise Pearson correlations between all six method families are shown, computed over mean $h^2$ profiles across the ten phenotypes. Hierarchical clustering with average linkage and Euclidean distance was applied to the correlation values to produce the dendrogram. Cell values report the Pearson $r$; statistically significant pairs ($p < 0.05$) are indicated. Methods within the same family do not show significantly stronger agreement than cross-family pairs (Mann--Whitney $p = 0.459$), indicating that algorithmic class rather than software family membership drives inter-method agreement.