Table of Contents
Fetching ...

Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support Recovery

Caixing Wang, Ziliang Shen

TL;DR

This work tackles distributed high-dimensional linear quantile regression by converting the non-smooth QR problem into a smooth least-squares problem through a double-smoothing, Newton-type transformation. The resulting DHSQR framework performs distributed estimation with minimal communication (broadcasting low-dimensional gradients) and uses a Lasso-penalized LS on a central node, achieving near-oracle rates with a constant number of iterations. Theoretical guarantees establish convergence rates of $\mathcal{O}_\mathbb{P}(\sqrt{s\log N / N})$ and beta-min conditions ensuring exact support recovery under standard regularity assumptions. Extensive simulations and a real-data HIV drug-sensitivity application demonstrate strong estimation accuracy, robust performance under heavy-tailed and heterogeneous noise, and favorable computation/communication efficiency compared with existing methods. The approach offers scalable, robust distributed quantile regression with provable guarantees for parameter estimation and variable selection in high dimensions.

Abstract

In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tackle these problems, we transform the original quantile regression into the least-squares optimization. By applying a double-smoothing approach, we extend a previous Newton-type distributed approach without the restrictive independent assumption between the error term and covariates. An efficient algorithm is developed, which enjoys high computation and communication efficiency. Theoretically, the proposed distributed estimator achieves a near-oracle convergence rate and high support recovery accuracy after a constant number of iterations. Extensive experiments on synthetic examples and a real data application further demonstrate the effectiveness of the proposed method.

Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support Recovery

TL;DR

This work tackles distributed high-dimensional linear quantile regression by converting the non-smooth QR problem into a smooth least-squares problem through a double-smoothing, Newton-type transformation. The resulting DHSQR framework performs distributed estimation with minimal communication (broadcasting low-dimensional gradients) and uses a Lasso-penalized LS on a central node, achieving near-oracle rates with a constant number of iterations. Theoretical guarantees establish convergence rates of and beta-min conditions ensuring exact support recovery under standard regularity assumptions. Extensive simulations and a real-data HIV drug-sensitivity application demonstrate strong estimation accuracy, robust performance under heavy-tailed and heterogeneous noise, and favorable computation/communication efficiency compared with existing methods. The approach offers scalable, robust distributed quantile regression with provable guarantees for parameter estimation and variable selection in high dimensions.

Abstract

In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tackle these problems, we transform the original quantile regression into the least-squares optimization. By applying a double-smoothing approach, we extend a previous Newton-type distributed approach without the restrictive independent assumption between the error term and covariates. An efficient algorithm is developed, which enjoys high computation and communication efficiency. Theoretically, the proposed distributed estimator achieves a near-oracle convergence rate and high support recovery accuracy after a constant number of iterations. Extensive experiments on synthetic examples and a real data application further demonstrate the effectiveness of the proposed method.
Paper Structure (26 sections, 10 theorems, 120 equations, 4 figures, 14 tables)

This paper contains 26 sections, 10 theorems, 120 equations, 4 figures, 14 tables.

Key Result

Theorem 3.7

Suppose that the initial estimator satisfies that $|\widehat{\boldsymbol{\beta}}_{0,h}-\boldsymbol{\beta}^{*}|_2=\mathcal{O}_\mathbb{P}(a_N)$ and let $h \asymp\left(s \log N / N\right)^{1 / 3}$, $b \asymp\left(s \log n / n\right)^{1 / 3}$ and $a_N \asymp \sqrt{s\log N/n}$. Take where $C$ is a sufficient large constant, and $\eta=\max\left\{(s\log n/n)^{1/3},\sqrt{s\log N/n}\right\}$. Then under A

Figures (4)

  • Figure 1: The $\ell_2$-error with an error bound between the true parameter and the estimated parameter versus the number of iterations with a fixed quantile level $\tau=0.5$. In the left panel, from top to bottom represent noise distributions that are Normal, $t_3$, and Cauchy distribution for the homoscedastic error case, respectively. In the right panel, from top to bottom represent noise distributions as Normal, $t_3$, and Cauchy distribution for the heteroscedastic error case, respectively.
  • Figure 2: The $\ell_2$-error from the true parameter versus the number of total and local sample size with a fixed quantile level $\tau=0.5$. In the top panel, from left to right show the effect of different total sample sizes $N$ for the homoscedastic and heteroscedastic error cases, respectively. In the bottom panel, from left to right show the effect of different local sample sizes $n$ for the homoscedastic and heteroscedastic error cases, respectively.
  • Figure 3: The $F_1$ score from the true parameter versus the number of total and local sample size with a fixed quantile level $\tau=0.5$. In the top panel, from left to right show the effect of different total sample sizes $N$ for the homoscedastic and heteroscedastic error cases, respectively. In the bottom panel, from left to right show the effect of different local sample sizes $n$ for the homoscedastic and heteroscedastic error cases, respectively.
  • Figure 4: The left figure represents the histogram of the initial drug sensitivity variable distribution, while the right figure represents the histogram of the drug sensitivity variable distribution after undergoing a logarithmic transformation.

Theorems & Definitions (20)

  • Remark 2.1
  • Remark 2.2
  • Theorem 3.7
  • Theorem 3.8
  • Theorem 3.9
  • Theorem 3.10
  • Lemma 1.1
  • Lemma 1.2
  • proof
  • Lemma 1.3
  • ...and 10 more