Bootstrap inference for linear regression between variables that are never jointly observed: application in in vivo experiments
Polina Arsenteva, Mohamed Amine Benadjaoud, Hervé Cardot
TL;DR
This work tackles estimating a linear regression when the predictor $X$ and outcome $Y$ are never observed jointly, by exploiting a grouping variable $G$ that is observed with both $X$ and $Y$. It develops two identification strategies: a moment-based method using conditional means and an optimal transport method based on Wasserstein distance, both yielding consistent and asymptotically Gaussian estimators under weak assumptions. Because explicit asymptotic variances are intractable except in simple cases, the authors implement a stratified bootstrap to construct confidence intervals, which proves effective in finite samples and often superior to naive interval methods. The methods are validated via simulations resembling in vivo data and applied to mouse radiotherapy data, where bootstrap-based inference uncovers relationships between biomarkers that naive approaches miss, demonstrating practical value for data-fusion settings with small samples.
Abstract
In modern experimental science, there is a common problem of estimating the coefficients of a linear regression in a context where the variables of interest cannot be observed simultaneously. When there is a categorical variable that is observed on all statistical units, we consider two estimators of linear regression that take this additional information into account: an estimator based on moments and an estimator based on optimal transport theory. These estimators are shown to be consistent and asymptotically Gaussian under weak hypotheses. The asymptotic variance has no explicit expression, except in some special cases, for which reason a stratified bootstrap approach is developed to construct confidence intervals for the estimated parameters, whose consistency is also shown. A simulation study evaluating and comparing the finite sample performance of these estimators demonstrates the advantages of the bootstrap approach in several realistic scenarios. An application to in vivo experiments, conducted in the context of studying radio-induced adverse effects in mice, revealed important relationships between the biomarkers of interest that could not be identified with the considered naive approach.
