Asymptotically-exact selective inference for quantile regression

Yumeng Wang; Snigdha Panigrahi; Xuming He

Asymptotically-exact selective inference for quantile regression

Yumeng Wang, Snigdha Panigrahi, Xuming He

TL;DR

This work develops an asymptotically-exact selective inference framework for quantile regression after model selection by coupling smoothed quantile regression with external randomization. The authors construct a one-dimensional pivot that accounts for the selection event and yields valid confidence intervals for the effects of selected variables on conditional quantile functions, without relying on strong distributional assumptions. The method leverages all available data for both selection and inference, and it demonstrates superior coverage, shorter interval lengths, and improved variable-selection accuracy compared with data-splitting or naive approaches across simulations and a real birth-weight dataset. The results hold uniformly over a broad class of data-generating distributions, and the approach offers practical scalability and potential extensions to other penalties and nonlinear models.

Abstract

In modern data analysis, it is common to select a model before performing statistical inference. Selective inference tools make adjustments for the model selection process in order to ensure reliable inference post selection. In this paper, we introduce an asymptotic pivot to infer about the effects of selected variables on conditional quantile functions. Utilizing estimators from smoothed quantile regression, our proposed pivot is easy to compute and yields asymptotically-exact selective inference without making strict distributional assumptions about the response variable. At the core of our pivot is the use of external randomization variables, which allows us to utilize all available samples for both selection and inference, without partitioning the data into independent subsets or discarding samples at any step. From simulation studies, we find that: (i) the asymptotic confidence intervals based on our pivot achieve the desired coverage rates, even in cases where sample splitting fails due to insufficient sample size for inference; (ii) our intervals are consistently shorter than those produced by sample splitting across various models and signal settings. We report similar findings when we apply our approach to study risk factors for low birth weights in a publicly accessible dataset of US birth records from 2022.

Asymptotically-exact selective inference for quantile regression

TL;DR

Abstract

Paper Structure (51 sections, 34 theorems, 296 equations, 15 figures, 6 tables)

This paper contains 51 sections, 34 theorems, 296 equations, 15 figures, 6 tables.

Introduction
The $\ell_1$-penalized SQR method
Selective inference and a first example
Randomized selective inference
Related work
Pivot using SQR estimators
Basics
The SQR estimators
Our pivot
Link with the least squares regression
Asymptotic theory
Simulation study
Coverage rates
Inferential power
Estimation accuracy
...and 36 more sections

Key Result

Proposition 1

Let $\kappa = \int_{-\infty}^{\infty}|u|^2 K(u) d u$. Under Assumptions aspt:moment_bound and aspt:Lip, we have that where $h'$ denotes the bandwidth used for inference and $\|\cdot\|$ represents the $\ell_2$-norm.

Figures (15)

Figure 1: Comparison between an adaptation of the polyhedral method ("Previous"), data splitting ("Splitting") and our proposed method ("Proposed"). Left: Box plots for coverage probabilities of 90% confidence intervals with the diamond symbol denoting the mean coverage rate of intervals. The "Proposed" method provides valid selective inference across all signal regimes, as does "Splitting". However, the coverage rates of the "Previous" method are significantly lower than the desired level in both "Low" and "Mid" signal regimes. Middle: Lengths of confidence intervals. The confidence intervals generated by "Proposed" are substantially shorter than that of "Previous” in both "Low" and "Mid" signal regimes, and are also shorter than "Splitting" across all regimes. Right: Proportions of infinitely long intervals. The "Previous" method has a high probability of generating infinite intervals in "Low" and "Mid" signal regimes, which is consistent with the finding in kivaranovic2020tight.
Figure 2: The performance of selection and inference of our proposed method with varying randomization variance levels. Randomization levels 1, 2, 3, and 4 correspond to $\delta^2 = 0.4, 0.6, 0.8, 1$, while randomization level 0 corresponds to the "Naive" method. Left: "Recall" across different randomization levels, showing the impact of varying $\delta^2$ on model selection. Middle: Coverage probabilities of 90% confidence intervals. Right: Lengths of confidence intervals. With a small randomization variance, our proposed method provides valid inference, whereas the "Naive" method underperforms.
Figure 3: Coverage rates of different methods across different models and signal settings. The gray dashed line represents the prespecified target coverage rate at $0.9$, and the diamond marks highlight the averaged coverage rates over all replications. We observed that the "Proposed" method consistently achieves the target coverage rate across all scenarios, whereas "Naive" and "Splitting" underperform.
Figure 4: The boxplots present the ratio of average interval lengths for the selected parameters between the "Proposed" method and the "Splitting" method across different models and signal strengths. "Proposed" yields significantly shorter intervals than "Splitting" in all settings.
Figure 5: $90\%$ confidence intervals for the variables chosen as significant by the full data analysis. "Splitting" method does not select the "Weight Gain" factor and fails to identify the significance of "Cigarettes 3rd Trimester", "Five Minute APGAR Score" and "Steroids" based on subsamples. In contrast, "Proposed" identifies the association between these factors and the low birth weight in twins even when a "Baseline" on the full data. The average length of the confidence intervals produced by "Proposed" is $0.125$, while "Splitting" results in an average interval length of $0.471$.
...and 10 more figures

Theorems & Definitions (68)

Proposition 1
Remark
Remark
Proposition 2
Corollary 1
Corollary 2
Remark
Proposition 3
Remark
Proposition 4
...and 58 more

Asymptotically-exact selective inference for quantile regression

TL;DR

Abstract

Asymptotically-exact selective inference for quantile regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (68)