Distributed Sparse Linear Regression under Communication Constraints
Rodney Fonseca, Boaz Nadler
TL;DR
This work tackles high-dimensional sparse linear regression in a distributed setting with tight per-machine communication constraints. It introduces a two-round protocol where each machine computes a debiased lasso estimator but communicates only a small subset of coordinates in the first round; a fusion center performs voting to recover the support, followed by a second round of centralized-like least-squares on the estimated support. The authors establish exact support-recovery guarantees under sublinear communication and provide $\ell_2$-error bounds that match centralized oracle performance under sufficient total sample size $N = nM$, supported by simulations showing competitive results against more communication-intensive methods. The results illuminate tradeoffs among communication, SNR, and the number of machines, and open pathways for extensions to non-Gaussian designs and other sparse estimation tasks.
Abstract
In multiple domains, statistical tasks are performed in distributed settings, with data split among several end machines that are connected to a fusion center. In various applications, the end machines have limited bandwidth and power, and thus a tight communication budget. In this work we focus on distributed learning of a sparse linear regression model, under severe communication constraints. We propose several two round distributed schemes, whose communication per machine is sublinear in the data dimension. In our schemes, individual machines compute debiased lasso estimators, but send to the fusion center only very few values. On the theoretical front, we analyze one of these schemes and prove that with high probability it achieves exact support recovery at low signal to noise ratios, where individual machines fail to recover the support. We show in simulations that our scheme works as well as, and in some cases better, than more communication intensive approaches.
