Table of Contents
Fetching ...

Optimizing the Optimal Weighted Average: Efficient Distributed Sparse Classification

Fred Lu, Ryan R. Curtin, Edward Raff, Francis Ferraro, James Holt

TL;DR

This paper addresses distributed training of penalized logistic regression on large-scale, high-dimensional data by introducing ACOWA, a two-round distributed algorithm built atop the optimal weighted average (OWA). It combines centroid augmentation to reduce partition variance in the first round and adaptive feature weighting (iterated Lasso) in the second round, followed by a robust merge step. The authors provide theoretical isoefficiency analyses showing ACOWA maintains scalable communication requirements comparable to OWA, and they demonstrate through extensive experiments that ACOWA yields substantially better accuracy, especially for sparse solutions, with only modest additional runtime. The approach offers a practical, scalable solution for high-dimensional distributed linear models, with broad applicability beyond logistic regression.

Abstract

While distributed training is often viewed as a solution to optimizing linear models on increasingly large datasets, inter-machine communication costs of popular distributed approaches can dominate as data dimensionality increases. Recent work on non-interactive algorithms shows that approximate solutions for linear models can be obtained efficiently with only a single round of communication among machines. However, this approximation often degenerates as the number of machines increases. In this paper, building on the recent optimal weighted average method, we introduce a new technique, ACOWA, that allows an extra round of communication to achieve noticeably better approximation quality with minor runtime increases. Results show that for sparse distributed logistic regression, ACOWA obtains solutions that are more faithful to the empirical risk minimizer and attain substantially higher accuracy than other distributed algorithms.

Optimizing the Optimal Weighted Average: Efficient Distributed Sparse Classification

TL;DR

This paper addresses distributed training of penalized logistic regression on large-scale, high-dimensional data by introducing ACOWA, a two-round distributed algorithm built atop the optimal weighted average (OWA). It combines centroid augmentation to reduce partition variance in the first round and adaptive feature weighting (iterated Lasso) in the second round, followed by a robust merge step. The authors provide theoretical isoefficiency analyses showing ACOWA maintains scalable communication requirements comparable to OWA, and they demonstrate through extensive experiments that ACOWA yields substantially better accuracy, especially for sparse solutions, with only modest additional runtime. The approach offers a practical, scalable solution for high-dimensional distributed linear models, with broad applicability beyond logistic regression.

Abstract

While distributed training is often viewed as a solution to optimizing linear models on increasingly large datasets, inter-machine communication costs of popular distributed approaches can dominate as data dimensionality increases. Recent work on non-interactive algorithms shows that approximate solutions for linear models can be obtained efficiently with only a single round of communication among machines. However, this approximation often degenerates as the number of machines increases. In this paper, building on the recent optimal weighted average method, we introduce a new technique, ACOWA, that allows an extra round of communication to achieve noticeably better approximation quality with minor runtime increases. Results show that for sparse distributed logistic regression, ACOWA obtains solutions that are more faithful to the empirical risk minimizer and attain substantially higher accuracy than other distributed algorithms.
Paper Structure (23 sections, 8 theorems, 26 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 8 theorems, 26 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

Suppose a partition $\mathcal{X}_i$ of size $k$ is an $\epsilon$-coreset with probability at least $1 - \delta$. Define $\mathcal{X}^{(+2)}_i \colonequals \mathcal{X}_i \cup \{ x_{r1}, x_{r2} \}$ where $x_{r1}$ and $x_{r2}$ are uniformly randomly sampled points from the dataset. $\mathcal{X}^{(+2)}_

Figures (9)

  • Figure 1: Vertical dashed lines show synchronization points between threads and boxes indicate different compute nodes. Approaches for many-core and distributed training of models with an $L_1$ penalty are either one-shot (left), or iterative (right), neither of which produce satisfying solutions of high accuracy in a limited time frame. Our ACOWA strikes a careful balance of sharing information, such that a more accurate solution can be obtained with just two rounds of communication.
  • Figure 2: Accuracy on held-out test set for different numbers of partitions $p$, when sparsity is fixed. The quality of the naive averaging and OWA models degrades significantly as $p$ increases. Our method ACOWA improves accuracy across all levels of $p$.
  • Figure 3: Number of nonzeros vs. test set accuracy in the single-node parallel setting. ACOWA has consistently better performance than other distributed methods, especially for sparser solutions on newsgroups. It generally also performs the best on amazon7 across a range of sparsities, compared to the second best method (CSL).
  • Figure 4: Number of nonzeros vs. test set accuracy in the multi-node distributed setting. ACOWA outperforms, again especially for sparser solutions. OWA on the criteo dataset exhibited significant variance. We were unable to run ProxCoCoA+ in this setting due to memory usage issues and extremely long runtimes.
  • Figure 5: Number of nonzeros vs. test set accuracy in the single-node parallel setting. Note the significantly better performance of ACOWA, especially for sparser solutions. Although LIBLINEAR generally produces the best results (since it is not a distributed algorithm at all, and uses the full dataset), for the same reason it cannot scale to very large datasets that cannot fit in RAM.
  • ...and 4 more figures

Theorems & Definitions (16)

  • Lemma 3.1
  • proof
  • Lemma 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Theorem 6.1
  • Theorem 6.2
  • Theorem 6.3
  • proof
  • proof
  • ...and 6 more