Environment Invariant Linear Least Squares

Jianqing Fan; Cong Fang; Yihong Gu; Tong Zhang

Environment Invariant Linear Least Squares

Jianqing Fan, Cong Fang, Yihong Gu, Tong Zhang

TL;DR

Environment Invariant Linear Least Squares (EILLS) addresses endogeneity and distributional shifts by leveraging invariance of $\mathbb{E}[y^{(e)}|\bm{x}^{(e)}_{S^*}]$ across multiple environments. The method combines a pooled $L_2$ loss with a focused invariance regularizer $\mathsf{J}(\bm{\beta})$ to promote exogeneity of selected variables, yielding provable non-asymptotic $\ell_2$ error bounds in the low-dimensional regime and variable selection consistency in high dimensions under a near-minimal identification condition. A key novelty is showing that two environments suffice for consistent recovery under mild heterogeneity, enabling statistically efficient estimation without strong structural priors. The work also connects to FGMM and invariant learning frameworks, discusses nonlinear extensions, and highlights computational considerations and potential approximations for scalability.

Abstract

This paper considers a multi-environment linear regression model in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariates may vary across different environments, yet the conditional expectations of $y$ given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic $\ell_2$ error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the $\ell_0$ penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.

Environment Invariant Linear Least Squares

TL;DR

Environment Invariant Linear Least Squares (EILLS) addresses endogeneity and distributional shifts by leveraging invariance of

across multiple environments. The method combines a pooled

loss with a focused invariance regularizer

to promote exogeneity of selected variables, yielding provable non-asymptotic

error bounds in the low-dimensional regime and variable selection consistency in high dimensions under a near-minimal identification condition. A key novelty is showing that two environments suffice for consistent recovery under mild heterogeneity, enabling statistically efficient estimation without strong structural priors. The work also connects to FGMM and invariant learning frameworks, discusses nonlinear extensions, and highlights computational considerations and potential approximations for scalability.

Abstract

given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic

error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the

penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.

Paper Structure (55 sections, 24 theorems, 268 equations, 8 figures, 1 algorithm)

This paper contains 55 sections, 24 theorems, 268 equations, 8 figures, 1 algorithm.

Introduction
The Problem under Study
Related Works
New Contributions and Comparison with Predecessors
Setup and Background
Multi-Environment Linear Regression
Notations
An Example: Structural Causal Model with Different Interventions
Methodology
Focused Linear Invariance Regularizer
Our Approach: EILLS
Theory
Pooled Linear Spurious Variables and the Bias of Pooled Least Squares
Local Strong Convexity for Population Loss
Statistical Analysis of the EILLS Estimator in the Low-dimensional Regime
...and 40 more sections

Key Result

Proposition 2.1

We have $\mathsf{R}_{\mathtt{oos}}(\bm{\beta}^*;\mathcal{U}_{\bm{\beta}^*, \sigma^2})=\sigma^2$, and for any $\bm{\beta} \in \mathbb{R}^p$,

Figures (8)

Figure 1: Linear SCMs with different interventions when $p=4$ and $|\mathcal{E}|=3$. Here $z_5=y$, $S^*=\{2\}$, and we omit the dependence on the exogenous variables $u_1,\ldots, u_5$ for a clear presentation. The arrow from node $i$ to node $j$ is marked by $B_{j,i}^{(e)}$ if $B_{j,i}^{(e)} \neq 0$. We can treat $e=1$ as the observational environment and $e=2,3$ as interventional environments. One intervention is performed to the mechanisms of variable $x_4$ (gray shadowed) in environment $e=2$, and simultaneously interventions on variable $x_2$, $x_3$ (gray shadowed) are applied in environment $e=3$.
Figure 2: (a) An illustration of the two-environment model, the SCMs in the two environments share the same induced graph, which is also plotted in (b). The arrow from node $x$ to node $z$ indicates that $x$ is the direct cause of $z$. (b) An illustration of how EILLS works under the two-environment model. The double-circled nodes represent the pooled linear spurious variables.
Figure 3: The simulation results for the model in \ref{['fig:scm-variable-selection']} (a). (a) depicts how the estimated coefficients for the EILLS estimator vary across hyper-parameter $\gamma$ in one trial when $n=300$: we use blue and red solid lines to represent the corresponding coefficients for variables in $S^*$ and $G$, respectively; and use orange dashed lines to represent the coefficients for other variables. The two gray vertical lines are $\gamma=15$ and $\gamma=3\times 10^3$, respectively. (b) depicts how the average $\ell_2$ errors (based on $500$ replications, shown in log scale) for different estimators (marked with different shapes) change when $n$ grows: 'LS $S$' is the estimator that runs least squares on $\bm{x}_S$ using all the data and 'PLS' is referred to 'LS $[p]$'. (c) depicts the average number of selected variables in $S^*$ ($+$) and $G$ ($\times$) for the EILLS estimator over $500$ replications.
Figure 4: The simulation results for different methods using the data generated from the model in \ref{['fig:scm-variable-selection']} (a). (a) depicts how the average $\ell_2$ prediction errors $\|\bar{\bm{\Sigma}}^{1/2}(\widehat{\bm{\beta}} - \bm{\beta}^*)\|_2^2$ (based on $300$ replications) for different invariance methods (marked with different shapes and colors) changes when $n$ grows. (b) and (c) visualizes the solutions of different methods in 60 replications when $n=100$ and $n=1000$, respectively. The true parameter $\bm{\beta}^*$ and the population pooled least squares solution $\bar{\bm{\beta}}$ are also included using red for reference.
Figure 5: A geometric illustration of the bias-difference debiasing idea. We consider the same case where $|\mathcal{E}|=p=2$, $x_1$ is the important variable, and $x_2$ is the pooled linear spurious variable. In each subplot, ${\color{myred} \bm{\beta}^*}$ is the true parameter and $\bm{\beta}^{(e)}$ with $e\in \{{\color{mylightblue} 1}, {\color{myblue} 2}\}$ is the population risk minimizer in each environment. Following the discussion in the text, $d_e = \|\bm{\beta}^* - \bm{\beta}^{(e)}\|_2 \asymp \|\bm{b}^{(e)}\|_2$ quantifies the bias of each environment and $\Delta = \|\bm{\beta}^{(1)} - \bm{\beta}^{(2)}\|_2$ represents the bias-difference. The four plots demonstrate four cases in which the magnitudes of bias and bias-difference vary, leading to different thresholds $\gamma^*$ satisfying $\gamma^* \asymp \left(\frac{ {\color{mylightblue} d_1 } + { \color{myblue} d_2} }{{\color{myorange} \Delta}}\right)^{2}$. The above two plots (a) and (b) are the cases where $\gamma^*$ is of reasonable, constant order. We can see when both bias and bias-difference are relatively small in plot (a) or relatively large in plot (b), and the ratio of the two quantities is within constant order that ${\color{mylightblue} d_1} + {\color{myblue} d_2} \asymp {\color{myorange} \Delta}$, the choice of $\gamma^*$ is also of constant order. However, when the bias is much larger than the bias difference that ${\color{mylightblue} d_1} + {\color{myblue} d_2} \gg {\color{myorange} \Delta}$ in (c), one needs to use a large $\gamma^*$ to accommodate the gain in loss decrease from selecting pooled linear spurious variable $x_2$. (d) present a case where the variable set $\{1,2\}$ is also LLS-invariance across the two environments because ${\color{mylightblue} \bm{\beta}^{(1)}}$ and ${\color{myblue} \bm{\beta}^{(2)}}$ coincides. In this case, our proposed EILLS approach will fail and converge to the spurious solution ${\color{mylightblue} \bm{\beta}^{(1)}}={\color{myblue} \bm{\beta}^{(2)}}$ instead of recovering ${\color{myred} \bm{\beta}^*}$.
...and 3 more figures

Theorems & Definitions (37)

Proposition 2.1: Properties of $\mathsf{R}_{\mathtt{oos}}$
Definition 2.1: CE-invariant Set
Proposition 2.2
Definition 3.1: LLS-invariant Set
Definition 4.1: Pooled Linear Spurious Variables
Proposition 4.1: Properties of Pooled Least Squares
Remark 4.1: Near Minimal Identification Condition
Theorem 4.2: Strong Convexity with respect to $\bm{\beta}^*$
Remark 4.2: Interpretation of the Quantities $\mathsf{b}_S$, $\bar{\mathsf{d}}_S$
Remark 4.3: Interpretation of the Critical Threshold $\gamma^*$
...and 27 more

Environment Invariant Linear Least Squares

TL;DR

Abstract

Environment Invariant Linear Least Squares

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (37)