Table of Contents
Fetching ...

Simultaneous inference for generalized linear models with unmeasured confounders

Jin-Hong Du, Larry Wasserman, Kathryn Roeder

TL;DR

A unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages is proposed that controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods.

Abstract

Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.

Simultaneous inference for generalized linear models with unmeasured confounders

TL;DR

A unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages is proposed that controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods.

Abstract

Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic -tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.
Paper Structure (75 sections, 22 theorems, 202 equations, 18 figures, 5 tables, 3 algorithms)

This paper contains 75 sections, 22 theorems, 202 equations, 18 figures, 5 tables, 3 algorithms.

Key Result

Proposition 1

Suppose there exists a sequence $\{\tau_p\}_{p\in\mathbb{N}}$ that is uniformly lower bounded away from zero such that the following conditions hold: Then as $p$ tends to infinity, it follows that $\bm{B} = \mathcal{P}_{\bm{\Gamma}}^{\perp}\bm{B} + \bm{o}(1)$ and $\|\mathcal{P}_{\bm{\Gamma}}\bm{B}\|_{\mathop{\mathrm{F}}} \lesssim \sqrt{p}\|\bm{B}\|_{1,1}/\tau_p$, where $\|\cdot\|_{1,1}$ is the el

Figures (18)

  • Figure 1: Causal diagrams on the generative models illustrating the relationship between the covariate $\bm{X}$, the latent variable $\bm{Z}$, and the response $\bm{Y}$. (a)$\bm{Z}$ is a hidden mediator when $\bm{X}$ causes $\bm{Z}$. (b) hidden confounder when $\bm{Z}$ causes $\bm{X}$. Note that we do not require knowledge of the relationship between $\bm{X}$ and $\bm{Z}$ for the analysis in this paper.
  • Figure 2: Overview of the simulated data. (a) The first and second rows show the summary of one simulated dataset for bulk cells (Poisson) in \ref{['subsec:simu-poisson']} and single cells (Negative Binomial) by Splatter in \ref{['subsec:simu-splatter']}, respectively. The first column shows the overall distribution of the generated counts; the second column shows the estimated dispersion parameters by methods of moments using the mean estimates from GLM with Poisson likelihood. (b) The proportions of zero and non-zero counts in the two datasets, colored in orange and blue, respectively. (c) The estimated dispersion parameter versus the estimated mean for the simulated single-cell dataset.
  • Figure 3: The Type-I errors, false discovery proportions (FDPs), powers, and precision of different methods on the simulated datasets over 100 runs, with varying numbers of samples $n\in\{100,250\}$ and numbers of latent factors $r\in\{2,10\}$. For glm, the maximum values of Type-I errors and FDPs are clipped at 0.1 and 0.5, respectively. The blue dashed lines indicate the desired cutoffs.
  • Figure 4: False discovery proportion at different $\alpha$ levels for $p$-values adjusted by the Benjamini-Hochberg procedure on 100 simulated datasets when $n=250$. The left and right panels show the results for different numbers of latent factors, (a)$r=2$ and (b)$r=10$, respectively. When $r=10$, the FDP of glm-naive is above 0.15; hence it is not shown in the figure.
  • Figure 5: Simulation results on 100 simulated scRNA-seq datasets generated by Splatter with varying numbers of samples $n\in\{100,200\}$. The four metrics are shown in four columns respectively. The blue dashed lines indicate the desired cutoffs for the statistical errors.
  • ...and 13 more figures

Theorems & Definitions (27)

  • Proposition 1: Identification of $\bm{B}$
  • Remark 1: The number of latent factors
  • Theorem 2: Estimation error of $\widehat{\bm{\Theta}}_0$
  • Theorem 3: Estimation error of $\mathcal{P}_{\widehat{\bm{\Gamma}}}$
  • Corollary 4: Estimation of latent components
  • Theorem 5: Estimation error of $\widehat{\bm{B}}$
  • Theorem 6: Asymptotical normality of $\widehat{\bm{B}}^{\textup{de}}$
  • Remark 2: Inference without unmeasured confounders
  • Remark 3: Incorporate information from latent factors
  • Remark 4: Estimation and inference with non-canonical links
  • ...and 17 more