A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Alireza Olama; Andreas Lundell; Jan Kronqvist; Elham Ahmadi; Eduardo Camponogara

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Alireza Olama, Andreas Lundell, Jan Kronqvist, Elham Ahmadi, Eduardo Camponogara

Abstract

This paper introduces the Bi-linear consensus Alternating Direction Method of Multipliers (Bi-cADMM), aimed at solving large-scale regularized Sparse Machine Learning (SML) problems defined over a network of computational nodes. Mathematically, these are stated as minimization problems with convex local loss functions over a global decision vector, subject to an explicit $\ell_0$ norm constraint to enforce the desired sparsity. The considered SML problem generalizes different sparse regression and classification models, such as sparse linear and logistic regression, sparse softmax regression, and sparse support vector machines. Bi-cADMM leverages a bi-linear consensus reformulation of the original non-convex SML problem and a hierarchical decomposition strategy that divides the problem into smaller sub-problems amenable to parallel computing. In Bi-cADMM, this decomposition strategy is based on a two-phase approach. Initially, it performs a sample decomposition of the data and distributes local datasets across computational nodes. Subsequently, a delayed feature decomposition of the data is conducted on Graphics Processing Units (GPUs) available to each node. This methodology allows Bi-cADMM to undertake computationally intensive data-centric computations on GPUs, while CPUs handle more cost-effective computations. The proposed algorithm is implemented within an open-source Python package called Parallel Sparse Fitting Toolbox (PsFiT), which is publicly available. Finally, computational experiments demonstrate the efficiency and scalability of our algorithm through numerical benchmarks across various SML problems featuring distributed datasets.

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Abstract

norm constraint to enforce the desired sparsity. The considered SML problem generalizes different sparse regression and classification models, such as sparse linear and logistic regression, sparse softmax regression, and sparse support vector machines. Bi-cADMM leverages a bi-linear consensus reformulation of the original non-convex SML problem and a hierarchical decomposition strategy that divides the problem into smaller sub-problems amenable to parallel computing. In Bi-cADMM, this decomposition strategy is based on a two-phase approach. Initially, it performs a sample decomposition of the data and distributes local datasets across computational nodes. Subsequently, a delayed feature decomposition of the data is conducted on Graphics Processing Units (GPUs) available to each node. This methodology allows Bi-cADMM to undertake computationally intensive data-centric computations on GPUs, while CPUs handle more cost-effective computations. The proposed algorithm is implemented within an open-source Python package called Parallel Sparse Fitting Toolbox (PsFiT), which is publicly available. Finally, computational experiments demonstrate the efficiency and scalability of our algorithm through numerical benchmarks across various SML problems featuring distributed datasets.

Paper Structure (16 sections, 1 theorem, 24 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 16 sections, 1 theorem, 24 equations, 4 figures, 1 table, 2 algorithms.

Introduction
Problem Description
Bi-linear Consensus Reformulation
Distributed Bi-Linear ADMM Algorithm
Termination
GPU Accelerated Data-Parallel Sub-Solver
Distributed Implementation and Bi-cADMM Pseudo-Code
Numerical Experiments
Hardware and Software
Empirical Convergence
Computational Time Comparison
Scalability across features
Scalability across data points
CPU-GPU Memory Transfer
Acknowledgment
...and 1 more sections

Key Result

Theorem 2.1

For any vector $\mathrm{x} \in \mathbb{R}^n$, the condition $\left\lVert\mathrm{x}\right\rVert_0 \leq \kappa$ holds if, and only if, a vector $\mathrm{s} \in \mathbb{R}^n$ and a scalar $t \in \mathbb{R}$ exist such that,

Figures (4)

Figure 1: Primal, dual, and bi-linear residuals for different bi-linear penalty parameters, $\rho_b = 2, 4, 8, 16$.
Figure 2: Comparison of computational times for feature scaling scenario across varying numbers of computational nodes ($N = 2, 4, 8$) using both GPU and CPU backends.
Figure 3: Comparison of computational times for sample scaling scenario across varying numbers of computational nodes ($N = 2, 4, 8$) using both GPU and CPU backends.
Figure 4: Comparison of total data transfer times for feature and sample scaling scenario across varying numbers of computational nodes ($N = 2, 4, 8$).

Theorems & Definitions (1)

Theorem 2.1: hempel2014novel

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Abstract

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (1)