High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

Fred Lu; Ryan R. Curtin; Edward Raff; Francis Ferraro; James Holt

High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

Fred Lu, Ryan R. Curtin, Edward Raff, Francis Ferraro, James Holt

TL;DR

This work addresses the challenge of scalable distributed training for high-dimensional sparse logistic regression by developing proxCSL, a proximal Newton-based solver for the communication-efficient surrogate likelihood framework. By incorporating adaptive proximal regularization (adaptive $\alpha$) and a Hessian-free proximal Newton approach with Hessian caching, proxCSL stabilizes updates, preserves sparsity, and achieves fast convergence under realistic high-dimensional regimes. Theoretical guarantees rely on restricted strong convexity and restricted Lipschitz Hessian, showing convergence and error bounds, while empirical results demonstrate superior accuracy and competitive runtimes across single-node and multi-node settings, often matching full-data solutions after just two updates. The method significantly advances practical distributed optimization for sparse models and offers a scalable path for learning in datasets with millions of features.

Abstract

As the size of datasets used in statistical learning continues to grow, distributed training of models has attracted increasing attention. These methods partition the data and exploit parallelism to reduce memory and runtime, but suffer increasingly from communication costs as the data size or the number of iterations grows. Recent work on linear models has shown that a surrogate likelihood can be optimized locally to iteratively improve on an initial solution in a communication-efficient manner. However, existing versions of these methods experience multiple shortcomings as the data size becomes massive, including diverging updates and efficiently handling sparsity. In this work we develop solutions to these problems which enable us to learn a communication-efficient distributed logistic regression model even beyond millions of features. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed, and similar or faster runtimes. Our code is available at \url{https://github.com/FutureComputing4AI/ProxCSL}.

High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

TL;DR

) and a Hessian-free proximal Newton approach with Hessian caching, proxCSL stabilizes updates, preserves sparsity, and achieves fast convergence under realistic high-dimensional regimes. Theoretical guarantees rely on restricted strong convexity and restricted Lipschitz Hessian, showing convergence and error bounds, while empirical results demonstrate superior accuracy and competitive runtimes across single-node and multi-node settings, often matching full-data solutions after just two updates. The method significantly advances practical distributed optimization for sparse models and offers a scalable path for learning in datasets with millions of features.

Abstract

Paper Structure (17 sections, 5 theorems, 20 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 17 sections, 5 theorems, 20 equations, 6 figures, 5 tables, 2 algorithms.

Introduction
Background and Related Work
Sparse logistic regression
Distributed estimation
One-shot estimation
Communication-efficient updates
Challenges for scaling CSL-like methods
A proximal solver for sparse CSL
Theoretical Results
Results
Test accuracy across sparsity levels
Runtime comparison
Convergence to a known model
Conclusion
Additional timing information
...and 2 more sections

Key Result

Lemma 1

Given quadratic loss $\mathcal{L}$ and current iterate $w^{j-1}$ which has been updated to the $(j-1)$-th coordinate, suppose the $j$-th partial first and second derivatives are $G_j$ and $H_{jj}$ respectively. Then the problem has solution

Figures (6)

Figure 1: Iterated CSL updates using a standard solver (sSCL) and our method (proxCSL) quickly converge to the optimal objective value (as defined by fitting on the full data) when the solution is sparse. However, sCSL often fails to reach the correct level of sparsity of a full data fit. Meanwhile, our specialized solver used in proxCSL attains the optimal sparsity.
Figure 2: Divergence between CSL and true objective values after one proxCSL update step, as a function of number of partitions (left) and intermediate solution sparsity (right). The divergence increases with decreasing sample size and increasing dimensionality, as expected. Setting the proximal parameter $\alpha > 0$ fixes the issue.
Figure 3: Number of nonzeros vs. test set accuracy in the single-node multi-core setting over a grid of regularization values. The distributed methods (sDANE, sCSL, proxCSL) are initialized with the OWA solution and updated twice. proxCSL (blue) cleanly outperforms other distributed methods across the datasets, often matching the full data solution computed with LIBLINEAR (dashed grey). sCSL performs nearly as well as proxCSL on amazon7 but not on other datasets. sDANE and sCSL fail to achieve sparse solutions on ember-100k even after the grid resolution was increased.
Figure 4: Number of nonzeros vs. test set accuracy in the distributed multi-node setting, after two update steps for the distributed methods (sDANE, sCSL, proxCSL). On both datasets, proxCSL (blue) outperforms the other methods across all sparsity levels. Due to the massive data size, no full data solution is computed. On criteo, OWA diverges at low regularizations, so we initialize the distributed methods with Naive Avg. instead. sDANE and sCSL fail to achieve sparse solutions on ember-1M even after the grid resolution was increased.
Figure 5: Convergence of CSL methods to the true solution on a synthetic dataset with known generating model. proxCSL outperforms the baselines in terms of model $L_2$ distance (left) as well as identifying whether a given weight should be nonzero (right).
...and 1 more figures

Theorems & Definitions (7)

Lemma 1
Theorem 1: Thm. 4, izbicki2020distributed
Theorem 2
Definition 1: Restricted strong convexity negahban2012unifiedjordan2018communication
Definition 2: Restricted Lipschitz Hessian
Theorem 3
Theorem 4

High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

TL;DR

Abstract

High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)