Table of Contents
Fetching ...

Searching for local associations while controlling the false discovery rate

Paula Gablenz, Matteo Sesia, Tianshu Sun, Chiara Sabatti

TL;DR

This work addresses heterogeneity in high-dimensional settings by introducing local conditional hypotheses that allow each explanatory variable to have context-specific associations with an outcome across covariate-defined environments. It extends the model-X knockoff filter to adaptive testing (the adaptive Local Knockoff Filter, aLKF), enabling both fixed and data-driven discovery of local associations under FDR control without sample splitting, using a data-cloaking strategy to prevent selection bias. The authors demonstrate the method on simulations and real GWAS data, showing improved localization of causal signals and the ability to identify sex- or environment-specific genetic effects, as illustrated by WHR analysis in UK Biobank. The approach provides a principled, scalable framework for uncovering subgroup-specific mechanisms in heterogeneous data with rigorous error control, holding promise for precision medicine and complex trait genetics.

Abstract

We introduce local conditional hypotheses that express how the relation between explanatory variables and outcomes changes across different contexts, described by covariates. By expanding upon the model-X knockoff filter, we show how to adaptively discover these local associations, all while controlling the false discovery rate. Our enhanced inferences can help explain sample heterogeneity and uncover interactions, making better use of the capabilities offered by modern machine learning models. Specifically, our method is able to leverage any model for the identification of data-driven hypotheses pertaining to different contexts. Then, it rigorously test these hypotheses without succumbing to selection bias. Importantly, our approach is efficient and does not require sample splitting. We demonstrate the effectiveness of our method through numerical experiments and by studying the genetic architecture of Waist-Hip-Ratio across different sexes in the UKBiobank.

Searching for local associations while controlling the false discovery rate

TL;DR

This work addresses heterogeneity in high-dimensional settings by introducing local conditional hypotheses that allow each explanatory variable to have context-specific associations with an outcome across covariate-defined environments. It extends the model-X knockoff filter to adaptive testing (the adaptive Local Knockoff Filter, aLKF), enabling both fixed and data-driven discovery of local associations under FDR control without sample splitting, using a data-cloaking strategy to prevent selection bias. The authors demonstrate the method on simulations and real GWAS data, showing improved localization of causal signals and the ability to identify sex- or environment-specific genetic effects, as illustrated by WHR analysis in UK Biobank. The approach provides a principled, scalable framework for uncovering subgroup-specific mechanisms in heterogeneous data with rigorous error control, holding promise for precision medicine and complex trait genetics.

Abstract

We introduce local conditional hypotheses that express how the relation between explanatory variables and outcomes changes across different contexts, described by covariates. By expanding upon the model-X knockoff filter, we show how to adaptively discover these local associations, all while controlling the false discovery rate. Our enhanced inferences can help explain sample heterogeneity and uncover interactions, making better use of the capabilities offered by modern machine learning models. Specifically, our method is able to leverage any model for the identification of data-driven hypotheses pertaining to different contexts. Then, it rigorously test these hypotheses without succumbing to selection bias. Importantly, our approach is efficient and does not require sample splitting. We demonstrate the effectiveness of our method through numerical experiments and by studying the genetic architecture of Waist-Hip-Ratio across different sexes in the UKBiobank.

Paper Structure

This paper contains 51 sections, 7 theorems, 43 equations, 17 figures, 8 tables, 5 algorithms.

Key Result

Theorem 1

Algorithm alg:sskf_second_phase applied to a data set $\mathcal{D} = ({[\mathbf{X}, \tilde{\mathbf{X}}],\mathbf{Y},\mathbf{Z}})$, where $\tilde{\mathbf{X}}$ are valid knockoffs for $\mathbf{X}$, and a fixed partition function $\nu$, controls the FDR at level $\alpha$ for the local hypotheses $\mathc

Figures (17)

  • Figure 1: Performance of the adaptive Local Knockoff Filter ( aLKF) and benchmark methods on synthetic data. The informativeness of the discoveries is quantified by the homogeneity of the corresponding subgroups (higher is better). The nominal FDR level is 0.1.
  • Figure 2: Performance of the adaptive Local Knockoff Filter ( aLKF) and benchmark methods on real genotype data with a simulated outcome, as a function of the signal amplitude. Half of the important genetic variables have global causal effects, while the other half have local effects. Other details are as in Figure \ref{['fig:experiment-heterogeneous-1']}.
  • Figure 3: Manhattan plot of aLKF test statistics for the analysis of WHR using the UK Biobank GWAS data. Each dot represents a local hypothesis for either the entire sample or a sex-specific subgroup. The x-axis indicates the genomic location of the leading SNP within each rejected region, labeled by chromosome, while the y-axis shows the corresponding test statistic. The dashed red line marks the rejection threshold for FDR control at 10%. Discoveries specific to females are shown in pink, and those specific to males are in blue.
  • Figure A1: Graphical representation of a non-parametric causal model linking the treatment ($X$), the outcome ($Y$), the measured covariates ($Z$), and possibly also other unmeasured covariate ($C$). A typical goal in this setting would be to test whether a particular treatment $X_j$ has any causal effect on the outcome. The joint distribution of $X \mid Z, C$ is assumed to be known and may depend only on $Z$, so that $X \perp \!\!\! \perp C \mid Z$.
  • Figure A2: Illustration of the symmetry requirement for measures of importance. Each column corresponds to one variable: we have three columns for the original variables, and three columns for their knockoffs. Each row corresponds to one observation. Colors are used to indicate the partition of observations corresponding to the tested local hypotheses (we have two subgroups for variable 1, three subgroups for variable 2, and one group for variable 3). Original observations are in a more saturated shade, while corresponding knockoffs are lighter. The (b) panel represents a possible swapping of $X_j$ with its knockoff $\tilde{X}_j$ within subgroups $\ell \in [L_j]$ and indicates how the measures of importance need to correspond to those in panel (a), with the swapping of the corresponding scores.
  • ...and 12 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 2
  • Proposition A1: From bates2020
  • proof
  • Proposition A2
  • proof
  • Proposition A3
  • proof : Proof of Proposition \ref{['prop:sskf_local_scores-simple']}
  • proof : Proof of Theorem \ref{['thm:coin-flip-fixed']}
  • Lemma A1
  • ...and 4 more