Table of Contents
Fetching ...

Bootstrap inference for linear regression between variables that are never jointly observed: application in in vivo experiments

Polina Arsenteva, Mohamed Amine Benadjaoud, Hervé Cardot

TL;DR

This work tackles estimating a linear regression when the predictor $X$ and outcome $Y$ are never observed jointly, by exploiting a grouping variable $G$ that is observed with both $X$ and $Y$. It develops two identification strategies: a moment-based method using conditional means and an optimal transport method based on Wasserstein distance, both yielding consistent and asymptotically Gaussian estimators under weak assumptions. Because explicit asymptotic variances are intractable except in simple cases, the authors implement a stratified bootstrap to construct confidence intervals, which proves effective in finite samples and often superior to naive interval methods. The methods are validated via simulations resembling in vivo data and applied to mouse radiotherapy data, where bootstrap-based inference uncovers relationships between biomarkers that naive approaches miss, demonstrating practical value for data-fusion settings with small samples.

Abstract

In modern experimental science, there is a common problem of estimating the coefficients of a linear regression in a context where the variables of interest cannot be observed simultaneously. When there is a categorical variable that is observed on all statistical units, we consider two estimators of linear regression that take this additional information into account: an estimator based on moments and an estimator based on optimal transport theory. These estimators are shown to be consistent and asymptotically Gaussian under weak hypotheses. The asymptotic variance has no explicit expression, except in some special cases, for which reason a stratified bootstrap approach is developed to construct confidence intervals for the estimated parameters, whose consistency is also shown. A simulation study evaluating and comparing the finite sample performance of these estimators demonstrates the advantages of the bootstrap approach in several realistic scenarios. An application to in vivo experiments, conducted in the context of studying radio-induced adverse effects in mice, revealed important relationships between the biomarkers of interest that could not be identified with the considered naive approach.

Bootstrap inference for linear regression between variables that are never jointly observed: application in in vivo experiments

TL;DR

This work tackles estimating a linear regression when the predictor and outcome are never observed jointly, by exploiting a grouping variable that is observed with both and . It develops two identification strategies: a moment-based method using conditional means and an optimal transport method based on Wasserstein distance, both yielding consistent and asymptotically Gaussian estimators under weak assumptions. Because explicit asymptotic variances are intractable except in simple cases, the authors implement a stratified bootstrap to construct confidence intervals, which proves effective in finite samples and often superior to naive interval methods. The methods are validated via simulations resembling in vivo data and applied to mouse radiotherapy data, where bootstrap-based inference uncovers relationships between biomarkers that naive approaches miss, demonstrating practical value for data-fusion settings with small samples.

Abstract

In modern experimental science, there is a common problem of estimating the coefficients of a linear regression in a context where the variables of interest cannot be observed simultaneously. When there is a categorical variable that is observed on all statistical units, we consider two estimators of linear regression that take this additional information into account: an estimator based on moments and an estimator based on optimal transport theory. These estimators are shown to be consistent and asymptotically Gaussian under weak hypotheses. The asymptotic variance has no explicit expression, except in some special cases, for which reason a stratified bootstrap approach is developed to construct confidence intervals for the estimated parameters, whose consistency is also shown. A simulation study evaluating and comparing the finite sample performance of these estimators demonstrates the advantages of the bootstrap approach in several realistic scenarios. An application to in vivo experiments, conducted in the context of studying radio-induced adverse effects in mice, revealed important relationships between the biomarkers of interest that could not be identified with the considered naive approach.
Paper Structure (18 sections, 14 theorems, 60 equations, 6 figures, 1 table)

This paper contains 18 sections, 14 theorems, 60 equations, 6 figures, 1 table.

Key Result

Lemma 2.1

If the model def:lmm holds and the assumption $\mathbf{H}_1$ is fulfilled, $\boldsymbol{\beta}$ is uniquely identified in terms of the conditional first order moments of $\mathbf{X}$ and $Y$ given $G$, Additionally, the noise variance $\sigma_\epsilon^2$ satisfies where $\sigma_Y^2$ is the variance of $Y$, $\boldsymbol{\Gamma}_X$ is the covariance matrix of $\mathbf{X}$ with elements $\hbox{Cov}

Figures (6)

  • Figure 1: Schematic representation of the design of an in vivo experiment studying the effect of irradiated volume.
  • Figure 2: Distribution of the data, collected from the irradiated patch under SBRT with 3 mm beam size: the expression of the gene IL6 on the left, and septal thickness on the right. The measurements were made 1, 3, 6 and 12 months after irradiation.
  • Figure 3: The effect of different values of $\rho$ on the data, with $K=4$ and $\sigma_{X}^2= 0.75$. a) Boxplots constructed from the simulated values of $X^k_i$. b) Boxplots constructed from the simulated values of $Y^k_i$ with lower relative noise level, i.e. $\rho=1.1$. c) Boxplots constructed from the simulated values of $Y^k_i$ with higher relative noise level, i.e. $\rho=1.01$.
  • Figure 4: a) Coverage rates, b) average amplitudes, and c) powers of the confidence intervals for the estimators of $\beta_1$ obtained from 500 simulations, with number of groups $K=4$ and number of animals per group $n=10$. The columns of the tables indicate simulation scenarios with different combinations of parameters: scenario S1 with lower group overlap ($\sigma^2_{X}=0.75$) and higher signal-to-noise ratio ($\rho=1.1$), S2 with higher group overlap ($\sigma^2_{X}=2$) and higher signal-to-noise ratio ($\rho=1.1$), S3 with lower group overlap ($\sigma^2_{X}=0.75$) and lower signal-to-noise ratio ($\rho=1.01$), and S4 with higher group overlap ($\sigma^2_{X}=2$) and lower signal-to-noise ratio ($\rho=1.01$). The lines indicate the method used to estimate the confidence intervals: "mm (asymp)" stands for the method of moments with asymptotic confidence intervals, "mm (boot)" for the method of moments with bootstrap, "ot (boot)" for the optimal transport method with bootstrap, "mm (student)" for the naive linear regression on means approach based on Student's distribution, and "simultaneous" for the classical linear regression estimation in the case where the predictor and the predicted variable are observed simultaneously.
  • Figure 5: Distributions of amplitudes of confidence intervals obtained with different methods based on 500 simulations under scenarios S2 (a) and S3 (b), with $K=4$ and $n=10$.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Lemma 2.1
  • Lemma 2.2
  • Lemma 4.1
  • Lemma 4.2
  • Proposition 4.1
  • Remark 4.1
  • Proposition 4.2
  • Lemma 5.1
  • Proposition 6.1
  • Proposition 6.2
  • ...and 5 more