Invariant quantile regression for heterogeneous environments

Bo Fu; Dandan Jiang

Invariant quantile regression for heterogeneous environments

Bo Fu, Dandan Jiang

Abstract

In this paper, we propose an invariant quantile regression (IQR) framework specifically designed for multi-environment datasets, which captures the invariance across different environments. This model is closely related to transfer learning, causal inference, and fair machine learning, and is motivated by scenarios in which the conditional probability of the response given covariates varies, while certain key features remain invariant. This perspective differs notably from previous works that restrict attention to the conditional mean, which is often insufficient in heterogeneous environments and the resulting estimators can become sensitive to ``bad" environments or changes in noise distributional shape. In contrast, quantile-based invariance naturally accommodates heterogeneity, and aligns more closely with structural causal models, in which variables invariant across environments at one or multiple quantile levels naturally indicate potential and stable causal predictors. Moreover, the set of endogenous variables under the IQR framework can be larger than that under the conditional mean framework typically, which in turn promotes more effective exclusion of spurious (no-causal) predictors provided that endogenous variables are not incorporated. To achieve this, we introduce a Kernel-Smoothed Focused Invariance Quantile Regression (KSFIQR) estimator, which leverages the underlying invariance structure and heterogeneity among environments, ensuring stable estimation across multiple environments. We establish the causal discovery properties of our method, demonstrate its ability to overcome the ``curse of endogeneity", and derive an $\ell_2$ error bound for our estimator in the low-dimensional regime, all in a non-asymptotic framework. From an algorithmic perspective, we implement the L-BFGS-B method and the Gumbel trick, with our numerical studies validating the proposed approach.

Invariant quantile regression for heterogeneous environments

Abstract

error bound for our estimator in the low-dimensional regime, all in a non-asymptotic framework. From an algorithmic perspective, we implement the L-BFGS-B method and the Gumbel trick, with our numerical studies validating the proposed approach.

Paper Structure (15 sections, 4 theorems, 53 equations, 4 figures, 2 algorithms)

This paper contains 15 sections, 4 theorems, 53 equations, 4 figures, 2 algorithms.

Introduction
Related Work
New Contributions
Roadmap and Notation
Background and setup
Problem statement
Convolution-type Smoothed techniques
Kernel-smoothed Focused Invariance Quantile Regression
Kernel-smoothed Focused Quantile Invariance Regularizer
Nearly-QR-invariance
Warmup: Local Strong Convexity of Population Loss
Non-asymptotic Error Bounds and Variable Selection consistency
Practical implementation
Illustration and Numerical Experiments
Conclusion

Key Result

Proposition 3.1

Under Assumptions assum3.1-assum3.3, assume that $f_l\leq f_{\varepsilon^{(e)}|\boldsymbol{x}_{S^*}^{(e)}}(0)\leq f_u$ for some $f_u\geq f_l>0$ and $f_{\varepsilon^{(e)}|\boldsymbol{x}_{S^*}^{(e)}}(u)$ is $l_0$-Lipschitz almost surely over $\boldsymbol{x}_{S^*}^{(e)}$ for each $e\in\mathcal{E}$, tha

Figures (4)

Figure 1: Simulation results for Model 1, where the first and second rows correspond to Model 1 (i) and Model 1 (ii), respectively. (a) reports the averaged $\ell_2$ error $\|\bar{\boldsymbol{\Sigma}}^{1/2}(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})\|_2^2$ while (d) reports the averaged $\ell_2$ error $\|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\|_2^2$ (since $\bar{\boldsymbol{\Sigma}}$ is not stable in Model 1 (ii)) over $200$ replications for each approach as a function of sample size $n$. (b) and (e) exhibit the average number of selected variables from the true support $S^*=\{1,2,3\}$ and the endogenous set $G=\{7,8,9\}$ across $200$ replications under varying $n$ for the EILLS and KSFIQR estimators. (c) and (f) provide a visual comparison of the solutions produced by each method over $60$ repeated trials at $n=500$. The true parameter $\boldsymbol{\beta}^*$ and the population-level pooled least squares estimate $\bar{\boldsymbol{\beta}}$ are included in red as benchmarks.
Figure 2: Solution paths of the KSFIQR estimator (with $\tau=0.5,~h=0.1$, and a Gaussian kernel) in a single trial under Model 1 (i) as $\gamma$ varies.
Figure 3: The average number of selected variables from the true support $S^*=\{1,2,3\}$ and the non-causal set $G=\{4\}$ across $200$ replications under varying $n$ for KSFIQR (with quantile levels $\tau\in\{0.1, 0.3, 0.5, 0.7, 0.9\}$, bandwidth $h=\sqrt{\tau(1-\tau)p\gamma/n_*}$ and a Gaussian kernel) and EILLS estimators under Model 2.
Figure 4: Average number of selected variables from the true support $S^*=\{1,2,3\}$ Over $200$ replications under varying $n$ for the KSFIQR estimator (with quantile levels $\tau\in\{0.5,0.6,0.7,0.8,0.9\}$, bandwidth $h=\sqrt{\tau(1-\tau)p\gamma/n_*}$ and a Gaussian kernel) and the EILLS estimator under Model 3 with $q=0.8$ or $q=0.9$.

Theorems & Definitions (9)

Definition 2.1: $\tau$-CP-invariant Set
Definition 3.1: $\tau$-SQR-invariant Set and $\tau$-QR-invariant Set
Definition 3.2: $(\tau,\delta)$-nearly-QR-invariant Set
Proposition 3.1: $S^*$: $(\delta,\tau)$-nearly-QR-invariant set
Definition 3.3: $\tau$-Pooled Spurious Endogenously Spurious Variables and Exogeneously Spurious Variables
Theorem 3.1
Remark 3.1
Theorem 3.2: Non-asymptotic $\ell_2$ Error Bound
Theorem 3.3: Non-asymptotic Variable Selection Consistency

Invariant quantile regression for heterogeneous environments

Abstract

Invariant quantile regression for heterogeneous environments

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (9)