Table of Contents
Fetching ...

Univariate-Guided Sparse Regression for Biobank-Scale High-Dimensional Omics Data

Joshua Richland, Tuomo Kiiskinen, William Wang, Sophia Lu, Balasubramanian Narasimhan, Trevor Hastie, Manuel Rivas, Robert Tibshirani

TL;DR

This work addresses PRS construction in ultra-high-dimensional genomics by adapting Univariate-Guided Sparse Regression (uniLasso) to biobank-scale SNP data. The method uses a two-stage approach that leverages univariate signal and leave-one-out predictions to bias the multivariate selection toward interpretable, sparse solutions, achieving predictive performance comparable to Lasso with substantially fewer predictors. Extensions to incorporate external summary statistics (uniLasso ES) further improve accuracy while preserving sparsity. Applied to UK Biobank data, uniLasso yields ~40% fewer nonzero SNPs than Lasso, maintains competitive R^2/AUC across traits, and, when combined with external scores, delivers the best predictive performance among the methods tested. The results support a sparse polygenic architecture and offer a scalable, interpretable alternative to dense LD-aware methods for PRS construction at population scales.

Abstract

We present a scalable framework for computing polygenic risk scores (PRS) in high-dimensional genomic settings using the recently introduced Univariate-Guided Sparse Regression (uniLasso). UniLasso is a two-stage penalized regression procedure that leverages univariate coefficients and magnitudes to stabilize feature selection and enhance interpretability. Building on its theoretical and empirical advantages, we adapt uniLasso for application to the UK Biobank, a population-based repository comprising over one million genetic variants measured on hundreds of thousands of individuals from the United Kingdom. We further extend the framework to incorporate external summary statistics to increase predictive accuracy. Our results demonstrate that uniLasso attains predictive performance comparable to standard Lasso while selecting substantially fewer variants, yielding sparser and more interpretable models. Moreover, it exhibits superior performance in estimating PRS relative to its competitors, such as PRS-CS. Integrating external scores further improves prediction while maintaining sparsity.

Univariate-Guided Sparse Regression for Biobank-Scale High-Dimensional Omics Data

TL;DR

This work addresses PRS construction in ultra-high-dimensional genomics by adapting Univariate-Guided Sparse Regression (uniLasso) to biobank-scale SNP data. The method uses a two-stage approach that leverages univariate signal and leave-one-out predictions to bias the multivariate selection toward interpretable, sparse solutions, achieving predictive performance comparable to Lasso with substantially fewer predictors. Extensions to incorporate external summary statistics (uniLasso ES) further improve accuracy while preserving sparsity. Applied to UK Biobank data, uniLasso yields ~40% fewer nonzero SNPs than Lasso, maintains competitive R^2/AUC across traits, and, when combined with external scores, delivers the best predictive performance among the methods tested. The results support a sparse polygenic architecture and offer a scalable, interpretable alternative to dense LD-aware methods for PRS construction at population scales.

Abstract

We present a scalable framework for computing polygenic risk scores (PRS) in high-dimensional genomic settings using the recently introduced Univariate-Guided Sparse Regression (uniLasso). UniLasso is a two-stage penalized regression procedure that leverages univariate coefficients and magnitudes to stabilize feature selection and enhance interpretability. Building on its theoretical and empirical advantages, we adapt uniLasso for application to the UK Biobank, a population-based repository comprising over one million genetic variants measured on hundreds of thousands of individuals from the United Kingdom. We further extend the framework to incorporate external summary statistics to increase predictive accuracy. Our results demonstrate that uniLasso attains predictive performance comparable to standard Lasso while selecting substantially fewer variants, yielding sparser and more interpretable models. Moreover, it exhibits superior performance in estimating PRS relative to its competitors, such as PRS-CS. Integrating external scores further improves prediction while maintaining sparsity.

Paper Structure

This paper contains 14 sections, 8 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Comparison of the test set predictive performance of the different polygenic risk score (PRS) methods with refitting on the training and the validation set. All regressions were performed with Adelie in Python. Test-set $R^2$ is the metric evaluated for the continuous phenotypes (height and BMI), and AUC is evaluated for the binary phenotypes (asthma and CHD).