Two-Stage Multiple Test Procedures Controlling False Discovery Rate with auxiliary variable and their Application to Set4Delta Mutant Data

Seohwa Hwang; Mark Louie Ramos; DoHwan Park; Junyong Park; Johan Lim; Erin Green

Two-Stage Multiple Test Procedures Controlling False Discovery Rate with auxiliary variable and their Application to Set4Delta Mutant Data

Seohwa Hwang, Mark Louie Ramos, DoHwan Park, Junyong Park, Johan Lim, Erin Green

TL;DR

This paper addresses improving false discovery rate (FDR) control in multiple testing by leveraging auxiliary information through a copula-based joint model of the primary statistic and an auxiliary variable. It introduces two-stage FDR procedures, Two-Stage FDR(H) and Two-Stage FDR(S), which use hard or soft thresholds on the auxiliary variable to refine testing of the primary variable while maintaining FDR at a pre-specified level. Through simulations and a Set4$\Delta$ yeast dataset, the methods demonstrate higher power than traditional one-stage approaches and many covariate-assisted methods, with robust FDR control even when the copula is misspecified. The work provides practical benefits for gene discovery under stress conditions and offers data and code for reproducibility and broader application to problems with a primary and auxiliary variable.

Abstract

In this paper, we present novel methodologies that incorporate auxiliary variables for multiple hypotheses testing related to the main point of interest while effectively controlling the false discovery rate. When dealing with multiple tests concerning the primary variable of interest, researchers can use auxiliary variables to set preconditions for the significance of primary variables, thereby enhancing test efficacy. Depending on the auxiliary variable's role, we propose two approaches: one terminates testing of the primary variable if it does not meet predefined conditions, and the other adjusts the evaluation criteria based on the auxiliary variable. Employing the copula method, we elucidate the dependence between the auxiliary and primary variables by deriving their joint distribution from individual marginal distributions.Our numerical studies, compared with existing methods, demonstrate that the proposed methodologies effectively control the FDR and yield greater statistical power than previous approaches solely based on the primary variable. As an illustrative example, we apply our methods to the Set4$Δ$ mutant dataset. Our findings highlight the distinctions between our methodologies and traditional approaches, emphasising the potential advantages of our methods in introducing the auxiliary variable for selecting more genes.

Two-Stage Multiple Test Procedures Controlling False Discovery Rate with auxiliary variable and their Application to Set4Delta Mutant Data

TL;DR

yeast dataset, the methods demonstrate higher power than traditional one-stage approaches and many covariate-assisted methods, with robust FDR control even when the copula is misspecified. The work provides practical benefits for gene discovery under stress conditions and offers data and code for reproducibility and broader application to problems with a primary and auxiliary variable.

Abstract

mutant dataset. Our findings highlight the distinctions between our methodologies and traditional approaches, emphasising the potential advantages of our methods in introducing the auxiliary variable for selecting more genes.

Paper Structure (12 sections, 3 theorems, 24 equations, 5 figures, 4 tables, 3 algorithms)

This paper contains 12 sections, 3 theorems, 24 equations, 5 figures, 4 tables, 3 algorithms.

Introduction
Motivating Data set: Set4$\Delta$ mutant data
Models and Estimation of Distributions
Calculation of Marginal $p$-values
Two-stage FDR(H)and Two-stage FDR(S)
False discovery rate controlling procedures
Simulation Studies
Real Data Analysis
Estimation of the Marginal $p$-values and Optimal Copula
Application of FDR Controlling Procedure
Concluding Remarks
Data and Code Availability

Key Result

Lemma 1

From the definition of $p_i^H(\gamma_1)$ in eqn:p_def1, we have for $\gamma=C(\gamma_1,\gamma_2)$ in eqn:gamma_def2 where $\gamma_1$ and $\gamma_2$ are in $[0,1]$.

Figures (5)

Figure 1: The histograms show that the distributions of (A) the logfold changes and (b) their estimated standard deviations.
Figure 2: Comparison of $p$-value distributions: (A) $p$-values generated symmetrically around 0.5, and (B) $p$-values generated from Set4$\Delta$ data
Figure 3: $p_i$'s for two-stage procedures (A),(B) of two-stage FDR(H) and (C) of two-stage FDR(S). In (A), $p_i^H$ with the threshold $\gamma_1=0.9$ and (B) with the threshold $\gamma_1=0.7$
Figure 4: Selection of $\gamma$ in two-stage FDR(H). The x-axis displays $\gamma$ and the y-axis represents the corresponding number of rejection with two-stage FDR(H).
Figure 5: Comparison of Rejection Regions: One-Stage vs. Two-Stage FDR Methods. The x-axis represents the standard deviation of logfold changes on a logarithmic scale as an auxiliary variable, while the y-axis shows the logfold changes as the primary variable. The sky-blue points corresponds to the rejection region at $\alpha < 0.05$, and the darker blue points represents the rejection region at $\alpha < 0.10$. In the legend, ✕ and ✴ signify the HXT gene family rejection with $\alpha = 0.05$ and $\alpha = 0.10$, respectively, while ▲ indicates non-rejected genes in the HXT gene family.

Theorems & Definitions (7)

Remark 1
Lemma 1
proof
theorem 1
proof
theorem 2
proof

Two-Stage Multiple Test Procedures Controlling False Discovery Rate with auxiliary variable and their Application to Set4Delta Mutant Data

TL;DR

Abstract

Two-Stage Multiple Test Procedures Controlling False Discovery Rate with auxiliary variable and their Application to Set4Delta Mutant Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)