Table of Contents
Fetching ...

Stabilizing Variable Selection and Regression

Niklas Pfister, Evan G. Williams, Jonas Peters, Ruedi Aebersold, Peter Bühlmann

Abstract

We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.

Stabilizing Variable Selection and Regression

Abstract

We consider regression in which one predicts a response with a set of predictors across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.

Paper Structure

This paper contains 32 sections, 4 theorems, 70 equations, 14 figures, 1 algorithm.

Key Result

Proposition 3.3

Assume Setting setting:causal, then for all intervention stable sets $S\subseteq\{1,\dots, d\}$ it holds that $S\in\mathbb{G}_{\mathcal{E}}$.

Figures (14)

  • Figure 1: Illustrative example of three linear regression procedures applied to data generated according to Example \ref{['ex:toy_example']} with two training and one testing environment. A good fit means that the dots are close to the identity line (given in black). Linear regression based on all predictors (red) leads to biased results on the testing environment, while a linear regression based only on direct causal variables of the response (blue) leads to unbiased estimation but with higher variance in both the testing and training environments. Stabilized regression (green) aims for the best fit which is also unbiased in the unobserved testing environment.
  • Figure 2: Stabilized regression (SR) applied to the Cholesterol Biosynthesis pathway (CB). The data set consists of protein expression levels ($n=315$) measured for $d=3939$ genes, $16$ of which are known to belong to CB (red gene names). We take protein expression levels of one known CB gene (Hmgcs1) as response $Y$. On the x- and y-axis we plot subsampling-based selection probabilities for two SR based variable selection procedures; y-axis: stable genes $\operatorname{SB}_I(Y)$ and x-axis: non-stable genes $\operatorname{NSB}_I(Y)$ (The precise definitions can be found in Section \ref{['sec:SP_vars']}.) Many significant genes (green area) are canonical CB genes (red label) or part of an adjacent pathway (blue label). Annotated genes with a semi-evident relationship have yellow labels and with no clear relation black labels. The color coding of the nodes (interpolating between red and black) corresponds to the fraction of times the sign of the regression coefficient was negative/positive (red: negative sign, black: positive sign, grey: never selected).
  • Figure 3: Illustration of multi-environment data generation setting. Only some environments are observed, but one would like to be able to make predictions on any further potentially unobserved environment.
  • Figure 4: Graphical illustration of variable selection. The goal is to find predictors $X=(X^1,\dots,X^{9})$ that are functionally related to the response $Y$. Here, variables $I=(I^1,I^2)$ are unobserved intervention variables. The colored areas represent different targets of inference: Markov blanket, stable blanket and parents (causal variables). If the goal is to get as close as possible to the parents, the stable blanket can improve on the Markov blanket if there are sufficiently many informative interventions.
  • Figure 5: Prediction results based on $1000$ repetitions from dat:lowdim. SR performs well both in the standard setting in which $\operatorname{MB}(Y)=\operatorname{SB}_I(Y)$ ($542$ repetitions) and the more difficult setting $\operatorname{MB}(Y)\neq\operatorname{SB}_I(Y)$ ($458$ repetitions). Apart from SR and IV no other method is expected to generalize to these settings. The difference in performance between SR and IV is a finite sample property and shows that averaging can outperform direct optimization of the optimization \ref{['eq:constrained_opt']}.
  • ...and 9 more figures

Theorems & Definitions (14)

  • Example 2.1: toy model
  • Definition 2.2: generalizable sets
  • Definition 2.3: generalizable and regression optimal sets
  • Definition 3.1: structural causal model
  • Definition 3.2: intervention stable sets
  • Proposition 3.3: intervention stable sets are generalizable
  • Definition 3.4: stable blanket
  • Theorem 3.5: stable blankets are generalizable and regression optimal
  • Lemma 3.6: OLS in linear SCMs
  • Corollary 3.7: OLS under strong interventions
  • ...and 4 more