Table of Contents
Fetching ...

On self-training of summary data with genetic applications

Buxin Su, Jiaoyang Huang, Jin Jin, Bingxin Zhao

TL;DR

This work demonstrates that resampling-based self-training using only summary statistics can achieve the same asymptotic predictive accuracy as conventional training with individual-level data in high-dimensional genetic prediction problems. By leveraging random matrix theory, the authors show the no-cost property holds for ridge-type and marginal-thresholding estimators and extends to ensemble and multi-ancestry settings. Key insight is that matching first- and second-order moments of the sampling distribution suffices for asymptotic equivalence, and dependence between pseudo-training and pseudo-validation does not induce overfitting. The theory is complemented by simulations and UK Biobank analyses, revealing practical viability and potential advantages when validation data are scarce. Overall, the framework broadens access to predictive modeling in genetics and other domains where only summary data are publicly available.

Abstract

Prediction model training is often hindered by limited access to individual-level data due to privacy concerns and logistical challenges, particularly in biomedical research. Resampling-based self-training presents a promising approach for building prediction models using only summary-level data. These methods leverage summary statistics to sample pseudo datasets for model training and parameter optimization, allowing for model development without individual-level data. Although increasingly used in precision medicine, the general behaviors of self-training remain unexplored. In this paper, we leverage a random matrix theory framework to establish the statistical properties of self-training algorithms for high-dimensional sparsity-free summary data. We demonstrate that, within a class of linear estimators, resampling-based self-training achieves the same asymptotic predictive accuracy as conventional training methods that require individual-level datasets. These results suggest that self-training with only summary data incurs no additional cost in prediction accuracy, while offering significant practical convenience. Our analysis provides several valuable insights and counterintuitive findings. For example, while pseudo-training and validation datasets are inherently dependent, their interdependence unexpectedly cancels out when calculating prediction accuracy measures, preventing overfitting in self-training algorithms. Furthermore, we extend our analysis to show that the self-training framework maintains this no-cost advantage when combining multiple methods or when jointly training on data from different distributions. We numerically validate our findings through simulations and real data analyses using the UK Biobank. Our study highlights the potential of resampling-based self-training to advance genetic risk prediction and other fields that make summary data publicly available.

On self-training of summary data with genetic applications

TL;DR

This work demonstrates that resampling-based self-training using only summary statistics can achieve the same asymptotic predictive accuracy as conventional training with individual-level data in high-dimensional genetic prediction problems. By leveraging random matrix theory, the authors show the no-cost property holds for ridge-type and marginal-thresholding estimators and extends to ensemble and multi-ancestry settings. Key insight is that matching first- and second-order moments of the sampling distribution suffices for asymptotic equivalence, and dependence between pseudo-training and pseudo-validation does not induce overfitting. The theory is complemented by simulations and UK Biobank analyses, revealing practical viability and potential advantages when validation data are scarce. Overall, the framework broadens access to predictive modeling in genetics and other domains where only summary data are publicly available.

Abstract

Prediction model training is often hindered by limited access to individual-level data due to privacy concerns and logistical challenges, particularly in biomedical research. Resampling-based self-training presents a promising approach for building prediction models using only summary-level data. These methods leverage summary statistics to sample pseudo datasets for model training and parameter optimization, allowing for model development without individual-level data. Although increasingly used in precision medicine, the general behaviors of self-training remain unexplored. In this paper, we leverage a random matrix theory framework to establish the statistical properties of self-training algorithms for high-dimensional sparsity-free summary data. We demonstrate that, within a class of linear estimators, resampling-based self-training achieves the same asymptotic predictive accuracy as conventional training methods that require individual-level datasets. These results suggest that self-training with only summary data incurs no additional cost in prediction accuracy, while offering significant practical convenience. Our analysis provides several valuable insights and counterintuitive findings. For example, while pseudo-training and validation datasets are inherently dependent, their interdependence unexpectedly cancels out when calculating prediction accuracy measures, preventing overfitting in self-training algorithms. Furthermore, we extend our analysis to show that the self-training framework maintains this no-cost advantage when combining multiple methods or when jointly training on data from different distributions. We numerically validate our findings through simulations and real data analyses using the UK Biobank. Our study highlights the potential of resampling-based self-training to advance genetic risk prediction and other fields that make summary data publicly available.

Paper Structure

This paper contains 34 sections, 14 theorems, 161 equations, 4 figures, 6 algorithms.

Key Result

Lemma 3.1

Let $\widehat{\bm{\Sigma}}_{n} = \mathbf{X}^{{ \mathrm{ T} }} \mathbf{X}/n$. Under Conditions cond-np-ratio - cond-X, for any $\theta \in \mathbb{R}_{+}$, with probability one, we have $(\widehat{\bm{\Sigma}}_{n} + \theta \mathbf{I}_{p})^{-1} \asymp (\tau_{n} (\theta) \bm{\Sigma} + \theta \mathbf{I Here $\tau_{n}(\theta)$ and $\rho_{n}(\theta)\in \mathbb{C}_{+}$ are solutions to the fixed point e

Figures (4)

  • Figure 1: Numerical comparison of prediction accuracy between resampling-based and individual-level training for the ridge-type estimator across various hyperparameter values, heritability levels, dimensionalities, and sparsity levels. In this numerical analysis, we assume that $\bm{\Sigma}$ is a block-wise diagonal matrix, $\bm{\Sigma} = {\rm diag}\{\bm{\Sigma}_1, \bm{\Sigma}_2, \dots, \bm{\Sigma}_{n_{\rm block}}\}$, where each block $\bm{\Sigma}_i$ follows an AR(1) process with correlation $\rho = 0.9$, as detailed in Equation \ref{['eqn:AR1']} and Section \ref{['sec:numer']}. Left: The out-of-sample $R^2$ values, denoted as $R_{\rm sum, R}^2(\theta)$ and $R_{\rm ind, R}^2(\theta)$, are computed using Algorithm \ref{['alg:sum']} and Algorithm \ref{['alg:ind']}, respectively. We evaluate $R_{\rm sum, R}^2(\theta)$ and $R_{\rm ind, R}^2(\theta)$ across a range of hyperparameter values, highlighting the best-performing hyperparameters, $\theta_{\rm sum, R}^*$ and $\theta_{\rm ind, R}^*$, with dashed vertical lines. The parameters are set as follows: $h^2 = 0.8$, $n = p = 5000$, $\kappa = 0.1$, and $n_w = 1000$. Right: The out-of-sample $R^2$ for $\widehat{\bm{\beta}}_{\rm R}(\theta_{\rm sum, R}^*)^*$ and $\widehat{\bm{\beta}}_{\rm R}(\theta_{\rm ind, R}^*)$, where $\theta_{\rm sum, R}^*$ and $\theta_{\rm ind, R}^*$ denote the best-performing hyperparameters selected by Algorithm \ref{['alg:sum']} and Algorithm \ref{['alg:ind']}, respectively. We compare the prediction accuracy across varying levels of heritability, dimensionality, and sparsity. The parameters are set as follows: $h^2 \in \{2/5, 1/2, 2/3, 4/5\}$, $p \in \{5000, 10000\}$, $n = 5000$, $\kappa \in \{0.05, 0.5, 0.9\}$, $n_{\rm block} = 20$, and $n_w = 1000$.
  • Figure 2: Numerical comparison of prediction accuracy between resampling-based and individual-level training for the marginal thresholding estimator across various hyperparameter values, heritability levels, dimensionalities, and sparsity levels. In this numerical analysis, we assume that $\bm{\Sigma}$ is a block-wise diagonal matrix, $\bm{\Sigma} = {\rm diag}\{\bm{\Sigma}_1, \bm{\Sigma}_2, \dots, \bm{\Sigma}_{n_{\rm block}}\}$, where each block $\bm{\Sigma}_i$ follows an AR(1) process with correlation $\rho = 0.9$, as detailed in Equation \ref{['eqn:AR1']} and Section \ref{['sec:numer']}. Left: The out-of-sample $R^2$ values, denoted as $R_{\rm sum, M}^2(\Theta)$ and $R_{\rm ind, M}^2(\Theta)$, are computed using Algorithm \ref{['alg:sum']} and Algorithm \ref{['alg:ind']}, respectively. We evaluate $R_{\rm sum, M}^2(\Theta)$ and $R_{\rm ind, M}^2(\Theta)$ across a range of hyperparameter values, highlighting the best-performing hyperparameters, $\Theta_{\rm sum, M}^*$ and $\Theta_{\rm ind, M}^*$, with dashed vertical lines. The parameters are set as follows: $h^2 = 0.8$, $n = p = 5000$, and $\kappa = 0.1$. Right: The out-of-sample $R^2$ for $\widehat{\bm{\beta}}_{\rm M}(\Theta_{\rm sum, M}^*)^*$ and $\widehat{\bm{\beta}}_{\rm M}(\Theta_{\rm ind, M}^*)$, where $\Theta_{\rm sum, M}^*$ and $\Theta_{\rm ind, M}^*$ the best-performing hyperparameters selected by Algorithm \ref{['alg:sum']} and Algorithm \ref{['alg:ind']}, respectively. We compare the prediction accuracy across varying levels of heritability, dimensionality, and sparsity. The parameters are set as follows: $h^2 \in \{2/5, 1/2, 2/3, 4/5\}$, $p \in \{5000, 10000\}$, $n = 5000$, $\kappa \in \{0.05, 0.5, 0.9\}$, and $n_{\rm block} = 20$.
  • Figure 3: Comparison of out-of-sample $R^2$ across resampling-based self-training methods using DXA imaging data. Each scatter plot compares the out-of-sample $R^2$ of $71$ DXA traits obtained using Algorithm \ref{['alg:sum']} for a pair of resampling-based self-training methods: (Left) LDpred2-pseudo vs. Lassosum2-pseudo, (Middle) Lassosum2-pseudo vs. Ensemble-pseudo, and (Right) LDpred2-pseudo vs. Ensemble-pseudo. Data points above the diagonal suggest superior performance of the method on the $y$-axis, while points below the diagonal indicate superior performance of the method on the $x$-axis. Results show that LDpred2-pseudo generally outperforms Lassosum2-pseudo, whereas they have comparable prediction accuracy for lower-heritability traits. Ensemble learning, which combines multiple methods, generally outperforms individual methods, especially for highly heritable traits.
  • Figure 4: Comparison of out-of-sample $R^2$ between resampling-based self-training and individual-level data training. Each scatter plot compares the out-of-sample $R^2$ of $71$ DXA traits obtained using Algorithm \ref{['alg:sum']} and Algorithm \ref{['alg:ind']}. To assess the impact of validation sample size on prediction accuracy, we evaluate different sample sizes for the individual-level validation dataset in Algorithm \ref{['alg:ind']}: (Left) $n^{(v)} = 100$, (Middle) $n^{(v)} = 500$, and (Right) $n^{(v)} = 1000$. For Algorithm \ref{['alg:sum']}, the sample size of the pseudo-validation dataset is fixed to be $20\%$ of the GWAS sample size. Results show that Algorithm \ref{['alg:sum']}, using only summary data, achieves prediction accuracy comparable to Algorithm \ref{['alg:ind']} when $n^{(v)} = 1000$. Moreover, Algorithm \ref{['alg:sum']} may outperform Algorithm \ref{['alg:ind']} when the individual-level validation dataset has a limited sample size ($n^{(v)} = 100$ or $500$).

Theorems & Definitions (16)

  • Definition 2.1
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Theorem 3.5
  • Theorem 3.6
  • Theorem 4.1
  • Definition 5.1
  • Theorem 5.2
  • ...and 6 more