A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

Muhammad Muneeb; David B. Ascher

A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

Muhammad Muneeb, David B. Ascher

Abstract

Polygenic risk score (PRS) tools differ substantially in statistical assumptions, input requirements, and implementation complexity, making direct comparison difficult. We developed a harmonized, implementation-aware benchmarking framework to evaluate 46 PRS tools across seven binary UK Biobank phenotypes and one continuous trait under three model configurations: null, PRS-only, and PRS plus covariates. The framework integrates standardized preprocessing, tool-specific execution, hyperparameter exploration, and unified downstream evaluation using five-fold cross-validation on high-performance computing infrastructure. In addition to predictive performance, we assessed runtime, memory use, input dependencies, and failure modes. A Friedman test across 40 phenotype--fold combinations confirmed significant differences in tool rankings ($χ^2 = 102.29$, $p = 2.57 \times 10^{-11}$), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.

A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

Abstract

), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.

Paper Structure (28 sections, 1 equation, 3 figures, 4 tables)

This paper contains 28 sections, 1 equation, 3 figures, 4 tables.

Introduction
Methods
Results
Discussion
Declarations

Figures (3)

Figure 1: Harmonized benchmarking framework used to evaluate 46 PRS tools across eight phenotypes. The framework standardizes installation, data preparation, hyperparameter definition, and tool execution, followed by cross-validation, performance evaluation, hyperparameter sensitivity analysis, and a structured analysis of the tool's results.
Figure 2: Predictive performance versus operational complexity across 46 PRS tools. Each point represents one PRS tool. The x-axis shows a composite operational complexity score computed as a weighted sum of five normalised components: data input requirements ($w = 0.20$), LD modelling burden ($w = 0.15$), log-normalised mean runtime ($w = 0.25$), normalised mean memory consumption ($w = 0.15$), and phenotype-level failure rate ($w = 0.25$), where failure rate is defined as the proportion of 40 phenotype--fold combinations for which no valid result was produced. The y-axis shows average predictive performance across all phenotypes for which the tool produced a valid result, expressed as AUC for binary phenotypes and $R^2$ for the continuous Height phenotype and averaged across all evaluated traits. Dashed lines denote the median complexity score (0.33) and median performance (0.578), dividing the space into four quadrants. Quadrant background shading reflects the mean performance of tools within each quadrant, with deeper shading indicating higher average performance.
Figure 3: Hierarchical clustering of PRS tools based on cross-phenotype similarity of SNP effect sizes. For each phenotype, tool-specific beta estimates were averaged across the five cross-validation folds and aligned on overlapping SNPs. Pairwise Pearson correlations between tools were then computed and averaged across phenotypes to obtain an overall similarity matrix. The dendrogram was constructed using distance defined as $1 - r$, where $r$ is the average pairwise correlation. Tools that cluster more closely produced more similar SNP effect-size profiles across the evaluated phenotypes.

A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

Abstract

A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

Authors

Abstract

Table of Contents

Figures (3)