Table of Contents
Fetching ...

Scalable Dual Coordinate Descent for Kernel Methods

Zishan Shao, Aditya Devarakonda

TL;DR

This work tackles the communication bottleneck in distributed training of kernelized models by introducing $s$-step variants of Dual Coordinate Descent (DCD) and Block Dual Coordinate Descent (BDCD) for Kernel SVM (K-SVM) and Kernel Ridge Regression (K-RR). The authors derive mathematically equivalent $s$-step algorithms that defer communication by a factor of $s$, provide theoretical cost bounds under the Hockney model, and demonstrate numerical stability and convergence equivalence in exact arithmetic. They implement high-performance C/MPI versions and validate strong scaling on a Cray EX system, reporting up to $9.8\times$ speedups on large-scale problems. The results show that latency-dominated regimes benefit most from $s$-step methods, while allreduce bandwidth can limit gains on extremely scalable runs; future work includes kernel-approximation techniques (e.g., Nyström) to further enlarge scalability.

Abstract

Dual Coordinate Descent (DCD) and Block Dual Coordinate Descent (BDCD) are important iterative methods for solving convex optimization problems. In this work, we develop scalable DCD and BDCD methods for the kernel support vector machines (K-SVM) and kernel ridge regression (K-RR) problems. On distributed-memory parallel machines the scalability of these methods is limited by the need to communicate every iteration. On modern hardware where communication is orders of magnitude more expensive, the running time of the DCD and BDCD methods is dominated by communication cost. We address this communication bottleneck by deriving $s$-step variants of DCD and BDCD for solving the K-SVM and K-RR problems, respectively. The $s$-step variants reduce the frequency of communication by a tunable factor of $s$ at the expense of additional bandwidth and computation. The $s$-step variants compute the same solution as the existing methods in exact arithmetic. We perform numerical experiments to illustrate that the $s$-step variants are also numerically stable in finite-arithmetic, even for large values of $s$. We perform theoretical analysis to bound the computation and communication costs of the newly designed variants, up to leading order. Finally, we develop high performance implementations written in C and MPI and present scaling experiments performed on a Cray EX cluster. The new $s$-step variants achieved strong scaling speedups of up to $9.8\times$ over existing methods using up to $512$ cores.

Scalable Dual Coordinate Descent for Kernel Methods

TL;DR

This work tackles the communication bottleneck in distributed training of kernelized models by introducing -step variants of Dual Coordinate Descent (DCD) and Block Dual Coordinate Descent (BDCD) for Kernel SVM (K-SVM) and Kernel Ridge Regression (K-RR). The authors derive mathematically equivalent -step algorithms that defer communication by a factor of , provide theoretical cost bounds under the Hockney model, and demonstrate numerical stability and convergence equivalence in exact arithmetic. They implement high-performance C/MPI versions and validate strong scaling on a Cray EX system, reporting up to speedups on large-scale problems. The results show that latency-dominated regimes benefit most from -step methods, while allreduce bandwidth can limit gains on extremely scalable runs; future work includes kernel-approximation techniques (e.g., Nyström) to further enlarge scalability.

Abstract

Dual Coordinate Descent (DCD) and Block Dual Coordinate Descent (BDCD) are important iterative methods for solving convex optimization problems. In this work, we develop scalable DCD and BDCD methods for the kernel support vector machines (K-SVM) and kernel ridge regression (K-RR) problems. On distributed-memory parallel machines the scalability of these methods is limited by the need to communicate every iteration. On modern hardware where communication is orders of magnitude more expensive, the running time of the DCD and BDCD methods is dominated by communication cost. We address this communication bottleneck by deriving -step variants of DCD and BDCD for solving the K-SVM and K-RR problems, respectively. The -step variants reduce the frequency of communication by a tunable factor of at the expense of additional bandwidth and computation. The -step variants compute the same solution as the existing methods in exact arithmetic. We perform numerical experiments to illustrate that the -step variants are also numerically stable in finite-arithmetic, even for large values of . We perform theoretical analysis to bound the computation and communication costs of the newly designed variants, up to leading order. Finally, we develop high performance implementations written in C and MPI and present scaling experiments performed on a Cray EX cluster. The new -step variants achieved strong scaling speedups of up to over existing methods using up to cores.

Paper Structure

This paper contains 16 sections, 2 theorems, 16 equations, 8 figures, 4 tables, 4 algorithms.

Key Result

Theorem 1

Let $H$ be the number of iterations of the Block Dual Coordinate Descent (BDCD) algorithm, $b$ the block size, $P$ the number of processors, and $A \in \mathbb{R}^{m \times n}$ the matrix that is partitioned using 1D-column layout. Under this setting, BDCD has the following asymptotic costs along th

Figures (8)

  • Figure 1: Comparison of DCD and $s$-step DCD convergence behavior for K-SVM-L1 and K-SVM-L2 problems.
  • Figure 2: Comparison of BDCD and $s$-step BDCD convergence behavior for K-RR problem.
  • Figure 3: Strong Scaling of DCD and $s$-step DCD for K-SVM.
  • Figure 4: Running Time Breakdown of DCD and $s$-step DCD for values of $P$ with fastest running times.
  • Figure 5: DCD and $s$-step DCD strong scaling and speedup on the news20.binary dataset for K-SVM with RBF kernel.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof