Table of Contents
Fetching ...

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Alireza Olama, Andreas Lundell, Jan Kronqvist, Elham Ahmadi, Eduardo Camponogara

Abstract

This paper introduces the Bi-linear consensus Alternating Direction Method of Multipliers (Bi-cADMM), aimed at solving large-scale regularized Sparse Machine Learning (SML) problems defined over a network of computational nodes. Mathematically, these are stated as minimization problems with convex local loss functions over a global decision vector, subject to an explicit $\ell_0$ norm constraint to enforce the desired sparsity. The considered SML problem generalizes different sparse regression and classification models, such as sparse linear and logistic regression, sparse softmax regression, and sparse support vector machines. Bi-cADMM leverages a bi-linear consensus reformulation of the original non-convex SML problem and a hierarchical decomposition strategy that divides the problem into smaller sub-problems amenable to parallel computing. In Bi-cADMM, this decomposition strategy is based on a two-phase approach. Initially, it performs a sample decomposition of the data and distributes local datasets across computational nodes. Subsequently, a delayed feature decomposition of the data is conducted on Graphics Processing Units (GPUs) available to each node. This methodology allows Bi-cADMM to undertake computationally intensive data-centric computations on GPUs, while CPUs handle more cost-effective computations. The proposed algorithm is implemented within an open-source Python package called Parallel Sparse Fitting Toolbox (PsFiT), which is publicly available. Finally, computational experiments demonstrate the efficiency and scalability of our algorithm through numerical benchmarks across various SML problems featuring distributed datasets.

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Abstract

This paper introduces the Bi-linear consensus Alternating Direction Method of Multipliers (Bi-cADMM), aimed at solving large-scale regularized Sparse Machine Learning (SML) problems defined over a network of computational nodes. Mathematically, these are stated as minimization problems with convex local loss functions over a global decision vector, subject to an explicit norm constraint to enforce the desired sparsity. The considered SML problem generalizes different sparse regression and classification models, such as sparse linear and logistic regression, sparse softmax regression, and sparse support vector machines. Bi-cADMM leverages a bi-linear consensus reformulation of the original non-convex SML problem and a hierarchical decomposition strategy that divides the problem into smaller sub-problems amenable to parallel computing. In Bi-cADMM, this decomposition strategy is based on a two-phase approach. Initially, it performs a sample decomposition of the data and distributes local datasets across computational nodes. Subsequently, a delayed feature decomposition of the data is conducted on Graphics Processing Units (GPUs) available to each node. This methodology allows Bi-cADMM to undertake computationally intensive data-centric computations on GPUs, while CPUs handle more cost-effective computations. The proposed algorithm is implemented within an open-source Python package called Parallel Sparse Fitting Toolbox (PsFiT), which is publicly available. Finally, computational experiments demonstrate the efficiency and scalability of our algorithm through numerical benchmarks across various SML problems featuring distributed datasets.
Paper Structure (16 sections, 1 theorem, 24 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 16 sections, 1 theorem, 24 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Theorem 2.1

For any vector $\mathrm{x} \in \mathbb{R}^n$, the condition $\left\lVert\mathrm{x}\right\rVert_0 \leq \kappa$ holds if, and only if, a vector $\mathrm{s} \in \mathbb{R}^n$ and a scalar $t \in \mathbb{R}$ exist such that,

Figures (4)

  • Figure 1: Primal, dual, and bi-linear residuals for different bi-linear penalty parameters, $\rho_b = 2, 4, 8, 16$.
  • Figure 2: Comparison of computational times for feature scaling scenario across varying numbers of computational nodes ($N = 2, 4, 8$) using both GPU and CPU backends.
  • Figure 3: Comparison of computational times for sample scaling scenario across varying numbers of computational nodes ($N = 2, 4, 8$) using both GPU and CPU backends.
  • Figure 4: Comparison of total data transfer times for feature and sample scaling scenario across varying numbers of computational nodes ($N = 2, 4, 8$).

Theorems & Definitions (1)

  • Theorem 2.1: hempel2014novel