Table of Contents
Fetching ...

Estimating Reproducibility in Genome-Wide Association Studies

Wei Jiang, Jing-Hao Xue, Weichuan Yu

TL;DR

Two probabilistic measures named Reproducibility Rate (RR) and False Irreproducibility rate (FIR) are proposed to quantitatively describe the behavior of primary positive associations in the replication study of genome-wide association studies.

Abstract

Genome-wide association studies (GWAS) are widely used to discover genetic variants associated with diseases. To control false positives, all findings from GWAS need to be verified with additional evidences, even for associations discovered from a high power study. Replication study is a common verification method by using independent samples. An association is regarded as true positive with a high confidence when it can be identified in both primary study and replication study. Currently, there is no systematic study on the behavior of positives in the replication study when the positive results of primary study are considered as the prior information. In this paper, two probabilistic measures named Reproducibility Rate (RR) and False Irreproducibility Rate (FIR) are proposed to quantitatively describe the behavior of primary positive associations (i.e. positive associations identified in the primary study) in the replication study. RR is a conditional probability measuring how likely a primary positive association will also be positive in the replication study. This can be used to guide the design of replication study, and to check the consistency between the results of primary study and those of replication study. FIR, on the contrary, measures how likely a primary positive association may still be a true positive even when it is negative in the replication study. This can be used to generate a list of potentially true associations in the irreproducible findings for further scrutiny. The estimation methods of these two measures are given. Simulation results and real experiments show that our estimation methods have high accuracy and good prediction performance.

Estimating Reproducibility in Genome-Wide Association Studies

TL;DR

Two probabilistic measures named Reproducibility Rate (RR) and False Irreproducibility rate (FIR) are proposed to quantitatively describe the behavior of primary positive associations in the replication study of genome-wide association studies.

Abstract

Genome-wide association studies (GWAS) are widely used to discover genetic variants associated with diseases. To control false positives, all findings from GWAS need to be verified with additional evidences, even for associations discovered from a high power study. Replication study is a common verification method by using independent samples. An association is regarded as true positive with a high confidence when it can be identified in both primary study and replication study. Currently, there is no systematic study on the behavior of positives in the replication study when the positive results of primary study are considered as the prior information. In this paper, two probabilistic measures named Reproducibility Rate (RR) and False Irreproducibility Rate (FIR) are proposed to quantitatively describe the behavior of primary positive associations (i.e. positive associations identified in the primary study) in the replication study. RR is a conditional probability measuring how likely a primary positive association will also be positive in the replication study. This can be used to guide the design of replication study, and to check the consistency between the results of primary study and those of replication study. FIR, on the contrary, measures how likely a primary positive association may still be a true positive even when it is negative in the replication study. This can be used to generate a list of potentially true associations in the irreproducible findings for further scrutiny. The estimation methods of these two measures are given. Simulation results and real experiments show that our estimation methods have high accuracy and good prediction performance.

Paper Structure

This paper contains 17 sections, 31 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: $\widehat{RR}$ and $\widehat{FIR}$ can estimate $RR$ and $FIR$ accurately. The x-axis is the true values of $RR$ (in (a)) or $FIR$ (in (b)) in the simulation study, and the y-axis is the corresponding estimated values $\widehat{RR}$ (in (a)) or $\widehat{FIR}$ (in (b)). The solid line is $y=x$.
  • Figure 2: $\widehat{RR}$ of an association can predict its reproducibility in the simulation study. (a) We use $\widehat{RR}$ as a score to decide reproduced/irreproduced status in the replication study. A PR curve is drawn by using different thresholds. The x-axis is the recall in reproducibility prediction in terms of $\widehat{RR}$, and the y-axis is the corresponding precision. $AUPRC$ is the area under precision-recall curve. (b) The associations are partitioned into 10 groups according to $\widehat{RR}$. The x-axis is the $\widehat{RR}$ of the group, which is the mid-point of the range of $\widehat{RR}$ within the group. The y-axis is the corresponding $RP$ of the group, which is the proportion of the reproduced associations in each group. The solid line is $y=x$.
  • Figure 3: Precision-recall curve of $\widehat{FIR}$ in the simulation study. $\widehat{FIR}$ of an irreproduced finding can be a quantitative index to describe the potential that this finding is a true association. The x-axis is the recall in false irreproducibility prediction in terms of $\widehat{FIR}$, and the y-axis is the corresponding precision. $AUPRC$ is the area under precision-recall curve.
  • Figure 4: Reproducibility prediction in T2D data from DIAGRAM. (a) The x-axis is the recall in reproducibility prediction in terms of $\widehat{RR}$, and the y-axis is the corresponding precision. $AUPRC$ is the area under precision-recall curve. Both PR curve based on $\widehat{RR}$ (solid line) and PR curve based on $p$-value (dashed line) are drawn in the figure. According to their $AUPRC$ values, $\widehat{RR}$ predicts reproducibility better than $p$-value. (b) The associations are partitioned into 5 groups according to $\widehat{RR}$. The x-axis is the $\widehat{RR}$ of the group, which is the mid-poin of $\widehat{RR}$ values. The y-axis is the corresponding $RP$ of the group, which is the proportion of the reproduced associations in each group. The solid line is $y=x$.
  • Figure 5: Reproducibility prediction in LDL Cholesterol data from GLGC. (a) The x-axis is the recall in reproducibility prediction in terms of $\widehat{RR}$, and the y-axis is the corresponding precision. $AUPRC$ is the area under precision-recall curve. Both PR curve based on $\widehat{RR}$ (solid line) and PR curve based on $p$-value (dashed line) are drawn in the figure. According to their $AUPRC$ values, $\widehat{RR}$ predicts reproducibility better than $p$-value. (b) The associations are partitioned into 5 groups according to $\widehat{RR}$. The x-axis is the $\widehat{RR}$ of the group, which is the mid-point of the range of $\widehat{RR}$. The y-axis is the corresponding $RP$ of the group, which is the proportion of the reproduced associations in each group. The solid line is $y=x$.