Table of Contents
Fetching ...

Invariant quantile regression for heterogeneous environments

Bo Fu, Dandan Jiang

Abstract

In this paper, we propose an invariant quantile regression (IQR) framework specifically designed for multi-environment datasets, which captures the invariance across different environments. This model is closely related to transfer learning, causal inference, and fair machine learning, and is motivated by scenarios in which the conditional probability of the response given covariates varies, while certain key features remain invariant. This perspective differs notably from previous works that restrict attention to the conditional mean, which is often insufficient in heterogeneous environments and the resulting estimators can become sensitive to ``bad" environments or changes in noise distributional shape. In contrast, quantile-based invariance naturally accommodates heterogeneity, and aligns more closely with structural causal models, in which variables invariant across environments at one or multiple quantile levels naturally indicate potential and stable causal predictors. Moreover, the set of endogenous variables under the IQR framework can be larger than that under the conditional mean framework typically, which in turn promotes more effective exclusion of spurious (no-causal) predictors provided that endogenous variables are not incorporated. To achieve this, we introduce a Kernel-Smoothed Focused Invariance Quantile Regression (KSFIQR) estimator, which leverages the underlying invariance structure and heterogeneity among environments, ensuring stable estimation across multiple environments. We establish the causal discovery properties of our method, demonstrate its ability to overcome the ``curse of endogeneity", and derive an $\ell_2$ error bound for our estimator in the low-dimensional regime, all in a non-asymptotic framework. From an algorithmic perspective, we implement the L-BFGS-B method and the Gumbel trick, with our numerical studies validating the proposed approach.

Invariant quantile regression for heterogeneous environments

Abstract

In this paper, we propose an invariant quantile regression (IQR) framework specifically designed for multi-environment datasets, which captures the invariance across different environments. This model is closely related to transfer learning, causal inference, and fair machine learning, and is motivated by scenarios in which the conditional probability of the response given covariates varies, while certain key features remain invariant. This perspective differs notably from previous works that restrict attention to the conditional mean, which is often insufficient in heterogeneous environments and the resulting estimators can become sensitive to ``bad" environments or changes in noise distributional shape. In contrast, quantile-based invariance naturally accommodates heterogeneity, and aligns more closely with structural causal models, in which variables invariant across environments at one or multiple quantile levels naturally indicate potential and stable causal predictors. Moreover, the set of endogenous variables under the IQR framework can be larger than that under the conditional mean framework typically, which in turn promotes more effective exclusion of spurious (no-causal) predictors provided that endogenous variables are not incorporated. To achieve this, we introduce a Kernel-Smoothed Focused Invariance Quantile Regression (KSFIQR) estimator, which leverages the underlying invariance structure and heterogeneity among environments, ensuring stable estimation across multiple environments. We establish the causal discovery properties of our method, demonstrate its ability to overcome the ``curse of endogeneity", and derive an error bound for our estimator in the low-dimensional regime, all in a non-asymptotic framework. From an algorithmic perspective, we implement the L-BFGS-B method and the Gumbel trick, with our numerical studies validating the proposed approach.
Paper Structure (15 sections, 4 theorems, 53 equations, 4 figures, 2 algorithms)

This paper contains 15 sections, 4 theorems, 53 equations, 4 figures, 2 algorithms.

Key Result

Proposition 3.1

Under Assumptions assum3.1-assum3.3, assume that $f_l\leq f_{\varepsilon^{(e)}|\boldsymbol{x}_{S^*}^{(e)}}(0)\leq f_u$ for some $f_u\geq f_l>0$ and $f_{\varepsilon^{(e)}|\boldsymbol{x}_{S^*}^{(e)}}(u)$ is $l_0$-Lipschitz almost surely over $\boldsymbol{x}_{S^*}^{(e)}$ for each $e\in\mathcal{E}$, tha

Figures (4)

  • Figure 1: Simulation results for Model 1, where the first and second rows correspond to Model 1 (i) and Model 1 (ii), respectively. (a) reports the averaged $\ell_2$ error $\|\bar{\boldsymbol{\Sigma}}^{1/2}(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})\|_2^2$ while (d) reports the averaged $\ell_2$ error $\|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\|_2^2$ (since $\bar{\boldsymbol{\Sigma}}$ is not stable in Model 1 (ii)) over $200$ replications for each approach as a function of sample size $n$. (b) and (e) exhibit the average number of selected variables from the true support $S^*=\{1,2,3\}$ and the endogenous set $G=\{7,8,9\}$ across $200$ replications under varying $n$ for the EILLS and KSFIQR estimators. (c) and (f) provide a visual comparison of the solutions produced by each method over $60$ repeated trials at $n=500$. The true parameter $\boldsymbol{\beta}^*$ and the population-level pooled least squares estimate $\bar{\boldsymbol{\beta}}$ are included in red as benchmarks.
  • Figure 2: Solution paths of the KSFIQR estimator (with $\tau=0.5,~h=0.1$, and a Gaussian kernel) in a single trial under Model 1 (i) as $\gamma$ varies.
  • Figure 3: The average number of selected variables from the true support $S^*=\{1,2,3\}$ and the non-causal set $G=\{4\}$ across $200$ replications under varying $n$ for KSFIQR (with quantile levels $\tau\in\{0.1, 0.3, 0.5, 0.7, 0.9\}$, bandwidth $h=\sqrt{\tau(1-\tau)p\gamma/n_*}$ and a Gaussian kernel) and EILLS estimators under Model 2.
  • Figure 4: Average number of selected variables from the true support $S^*=\{1,2,3\}$ Over $200$ replications under varying $n$ for the KSFIQR estimator (with quantile levels $\tau\in\{0.5,0.6,0.7,0.8,0.9\}$, bandwidth $h=\sqrt{\tau(1-\tau)p\gamma/n_*}$ and a Gaussian kernel) and the EILLS estimator under Model 3 with $q=0.8$ or $q=0.9$.

Theorems & Definitions (9)

  • Definition 2.1: $\tau$-CP-invariant Set
  • Definition 3.1: $\tau$-SQR-invariant Set and $\tau$-QR-invariant Set
  • Definition 3.2: $(\tau,\delta)$-nearly-QR-invariant Set
  • Proposition 3.1: $S^*$: $(\delta,\tau)$-nearly-QR-invariant set
  • Definition 3.3: $\tau$-Pooled Spurious Endogenously Spurious Variables and Exogeneously Spurious Variables
  • Theorem 3.1
  • Remark 3.1
  • Theorem 3.2: Non-asymptotic $\ell_2$ Error Bound
  • Theorem 3.3: Non-asymptotic Variable Selection Consistency