Distributed Sparse Linear Regression under Communication Constraints

Rodney Fonseca; Boaz Nadler

Distributed Sparse Linear Regression under Communication Constraints

Rodney Fonseca, Boaz Nadler

TL;DR

This work tackles high-dimensional sparse linear regression in a distributed setting with tight per-machine communication constraints. It introduces a two-round protocol where each machine computes a debiased lasso estimator but communicates only a small subset of coordinates in the first round; a fusion center performs voting to recover the support, followed by a second round of centralized-like least-squares on the estimated support. The authors establish exact support-recovery guarantees under sublinear communication and provide $\ell_2$-error bounds that match centralized oracle performance under sufficient total sample size $N = nM$, supported by simulations showing competitive results against more communication-intensive methods. The results illuminate tradeoffs among communication, SNR, and the number of machines, and open pathways for extensions to non-Gaussian designs and other sparse estimation tasks.

Abstract

In multiple domains, statistical tasks are performed in distributed settings, with data split among several end machines that are connected to a fusion center. In various applications, the end machines have limited bandwidth and power, and thus a tight communication budget. In this work we focus on distributed learning of a sparse linear regression model, under severe communication constraints. We propose several two round distributed schemes, whose communication per machine is sublinear in the data dimension. In our schemes, individual machines compute debiased lasso estimators, but send to the fusion center only very few values. On the theoretical front, we analyze one of these schemes and prove that with high probability it achieves exact support recovery at low signal to noise ratios, where individual machines fail to recover the support. We show in simulations that our scheme works as well as, and in some cases better, than more communication intensive approaches.

Distributed Sparse Linear Regression under Communication Constraints

TL;DR

-error bounds that match centralized oracle performance under sufficient total sample size

, supported by simulations showing competitive results against more communication-intensive methods. The results illuminate tradeoffs among communication, SNR, and the number of machines, and open pathways for extensions to non-Gaussian designs and other sparse estimation tasks.

Abstract

Paper Structure (28 sections, 24 theorems, 177 equations, 5 figures, 4 algorithms)

This paper contains 28 sections, 24 theorems, 177 equations, 5 figures, 4 algorithms.

Introduction
Notations.
Review of previous works
Background on lasso and debiased lasso
Distributed sparse regression with restricted communication
Variants of Algorithm \ref{['alg:count_votes']}
Theoretical results
Communication cost
Accuracy of $\hat{\theta}$
Comparison to theoretical results of amiraz2022distributed
Simulations
Advantage of sending signs
Error decay versus sample size or number of machines
Cross Validation
Summary and Discussion
...and 13 more sections

Key Result

Theorem 1

Let $\hat{\cal S}$ be the support set found by Algorithm alg:count_votes with $V_T = \ln d$ and where $\epsilon \geq 2 \delta_R \sqrt{\frac{C_{\max}}{\ln d}}$ and $\delta_R$ is defined in Eq. e:bias_level. Assume that assump:gaussian_design_bounded_spectrum--assump:thetamin hold, that dimension $d$ is sufficiently large and the number of machines $M$ satisfies If the SNR satisfies with $c(d,M)

Figures (5)

Figure 1: SNR lower bounds of Eq. \ref{['e:SNR_support_recovery_large_tau']} (solid red) and \ref{['e:SNR_support_recovery_thresholds_not_fixed']} (dotted) for $d=50000$, $M= 1133 \approx d^{0.65}$, $\epsilon=0$ and $V_T = \ln d$.
Figure 2: Simulation results for $n=250$, $d=5000$, $K=20$ and $M=100$. The shaded regions represent $90\%$ confidence bands.
Figure 3: Simulation results for topL-topK and its variant topL-topK-signs where the first round uses sums of signs as described in Algorithm \ref{['alg:count_signed_votes']}.
Figure 4: Estimation error (on a log-log scale) as a function of sample size $n$ or number of machines $M$. In both cases, $d=5000$, $K=20$ and $\theta_{\min} = 0.4081$.
Figure 5: Accuracy of support recovery (left) and error $\|\hat{\theta}-\theta^*\|_2$ on a log scale (right), as function of SNR, averaged over 50 realizations. Red dots are results with a fixed identical $\lambda$ in all machines. Cyan dots are results with lasso parameter $\lambda$ estimated separately at each machine by 10-fold cross validation.

Theorems & Definitions (49)

Remark 4.1
Remark 4.2
Remark 4.3
Theorem 1
Theorem 2
Remark 5.1
Corollary 1
Lemma 1
Lemma 2
Lemma 3
...and 39 more

Distributed Sparse Linear Regression under Communication Constraints

TL;DR

Abstract

Distributed Sparse Linear Regression under Communication Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (49)