Table of Contents
Fetching ...

Two-Stage Testing in a high dimensional setting

Marianne A Jonker, Luc van Schijndel, Eric Cator

Abstract

In a high dimensional regression setting in which the number of variables ($p$) is much larger than the sample size ($n$), the number of possible two-way interactions between the variables is immense. If the number of variables is in the order of one million, which is usually the case in e.g., genetics, the number of two-way interactions is of the order one million squared. In the pursuit of detecting two-way interactions, testing all pairs for interactions one-by-one is computational unfeasible and the multiple testing correction will be severe. In this paper we describe a two-stage testing procedure consisting of a screening and an evaluation stage. It is proven that, under some assumptions, the tests-statistics in the two stages are asymptotically independent. As a result, multiplicity correction in the second stage is only needed for the number of statistical tests that are actually performed in that stage. This increases the power of the testing procedure. Also, since the testing procedure in the first stage is computational simple, the computational burden is lowered. Simulations have been performed for multiple settings and regression models (generalized linear models and Cox PH model) to study the performance of the two-stage testing procedure. The results show type I error control and an increase in power compared to the procedure in which the pairs are tested one-by-one.

Two-Stage Testing in a high dimensional setting

Abstract

In a high dimensional regression setting in which the number of variables () is much larger than the sample size (), the number of possible two-way interactions between the variables is immense. If the number of variables is in the order of one million, which is usually the case in e.g., genetics, the number of two-way interactions is of the order one million squared. In the pursuit of detecting two-way interactions, testing all pairs for interactions one-by-one is computational unfeasible and the multiple testing correction will be severe. In this paper we describe a two-stage testing procedure consisting of a screening and an evaluation stage. It is proven that, under some assumptions, the tests-statistics in the two stages are asymptotically independent. As a result, multiplicity correction in the second stage is only needed for the number of statistical tests that are actually performed in that stage. This increases the power of the testing procedure. Also, since the testing procedure in the first stage is computational simple, the computational burden is lowered. Simulations have been performed for multiple settings and regression models (generalized linear models and Cox PH model) to study the performance of the two-stage testing procedure. The results show type I error control and an increase in power compared to the procedure in which the pairs are tested one-by-one.

Paper Structure

This paper contains 14 sections, 6 theorems, 58 equations, 6 figures, 1 table.

Key Result

theorem \oldthetheorem

Suppose that $Y_1|{{X}}_1, \ldots, Y_n|{{X}}_n$ is a sample from the generalized linear model with a canonical link function $M_{true}$ in (MM0). Let $\hat{{\theta}}_1$ and $\hat{{\theta}}_2$ denote the maximum likelihood estimators of ${\theta}_1^0$ and ${{\theta}}_2^0$ in the nested models $M_1$ a

Figures (6)

  • Figure 1: Heat map of the LD pattern for the 3000 markers based on data of 2000 individuals. LD is measured by the squared correlation coefficient.
  • Figure 2: Power as a function of the interaction effect for different first stage threshold (FST), with independent markers. First row: linear regression model; second row: Cox PH model. First column: no main effects; Second column: both main effects equal 0.5; Third column: both main effects equal -0.5. (Note the different scales on the x-axes.)
  • Figure 3: Power as a function of the interaction effect for different first stage threshold (FST), with correlated markers. First row: linear regression model; second row: Cox PH model. First column: no main effects; Second column: both main effects equal 0.5; Third colum: both main effects equal -0.5. (Note the different scales on the x-axes.)
  • Figure 4: Scatter plot for the value of the Wald test-statistics of the main effect in the marginal model against the value of the interaction effect in the full model. Left: linear regression model, middle: Cox PH model, right: Poisson regression model.
  • Figure : Power as a function of the interaction effect for different first stage threshold (FST) for the Poisson regression model. First row: independent markers; Second row: correlated markers. First column: no main effects; Second column: both main effects equal to 0.5; Third column: both main effects equal to -0.5. (Note the different scales on the x-axes.)
  • ...and 1 more figures

Theorems & Definitions (11)

  • theorem \oldthetheorem
  • theorem \oldthetheorem: Pairwise independence in the linear regression model
  • theorem \oldthetheorem: Mixed pairwise independence of estimated regression coefficients
  • theorem \oldthetheorem: Pairwise independence in the Cox PH model
  • proof
  • proof
  • lemma 1: Mixed pairwise independence in GLMs
  • proof
  • lemma 2: FWER control
  • proof
  • ...and 1 more