Table of Contents
Fetching ...

DuaLip-GPU Technical Report

Gregory Dexter, Aida Rahmattalabi, Sanjana Garg, Qinquan Song, Ruby Tu, Yuan Gao, Yi Zhang, Zhipeng Wang, Rahul Mazumder

TL;DR

A redesigned solver architecture that decouples problem specification from the optimization engine and targets GPU execution is presented, and the underlying ridge-regularized dual ascent method is improved with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter.

Abstract

Large-scale linear programs (LPs) arise in many decision systems, including ranking, allocation, and matching problems that must be solved repeatedly at massive scale. Prior work such as ECLIPSE and LinkedIn's open-source DuaLip showed that ridge-regularized dual ascent with first-order methods can scale to these settings. However, the original implementation was tightly coupled to a small number of schemas and built on a CPU-centric Scala/Spark stack, limiting extensibility and preventing effective use of modern accelerators. We present a redesigned solver architecture that decouples problem specification from the optimization engine and targets GPU execution. The system uses an operator-centric programming model in which LP formulations are expressed through composable primitives for dual objective evaluation and blockwise projection operators for decomposable constraint families. This design allows new formulations to be added locally while reusing a shared optimization loop, diagnostics, and distributed infrastructure. To realize the available parallelism, we develop GPU execution techniques tailored to sparse matching constraints, including constraint-aligned sparse layouts, batched projection kernels, and a distributed design that communicates only dual variables. Further, we improve the underlying ridge-regularized dual ascent method with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter. On extreme-scale matching workloads, the GPU implementation achieves at least a 10x wall-clock speedup over the prior distributed CPU DuaLip solver under matched stopping criteria, while maintaining convergence guarantees.

DuaLip-GPU Technical Report

TL;DR

A redesigned solver architecture that decouples problem specification from the optimization engine and targets GPU execution is presented, and the underlying ridge-regularized dual ascent method is improved with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter.

Abstract

Large-scale linear programs (LPs) arise in many decision systems, including ranking, allocation, and matching problems that must be solved repeatedly at massive scale. Prior work such as ECLIPSE and LinkedIn's open-source DuaLip showed that ridge-regularized dual ascent with first-order methods can scale to these settings. However, the original implementation was tightly coupled to a small number of schemas and built on a CPU-centric Scala/Spark stack, limiting extensibility and preventing effective use of modern accelerators. We present a redesigned solver architecture that decouples problem specification from the optimization engine and targets GPU execution. The system uses an operator-centric programming model in which LP formulations are expressed through composable primitives for dual objective evaluation and blockwise projection operators for decomposable constraint families. This design allows new formulations to be added locally while reusing a shared optimization loop, diagnostics, and distributed infrastructure. To realize the available parallelism, we develop GPU execution techniques tailored to sparse matching constraints, including constraint-aligned sparse layouts, batched projection kernels, and a distributed design that communicates only dual variables. Further, we improve the underlying ridge-regularized dual ascent method with Jacobi-style row normalization, primal scaling, and a continuation scheme for the regularization parameter. On extreme-scale matching workloads, the GPU implementation achieves at least a 10x wall-clock speedup over the prior distributed CPU DuaLip solver under matched stopping criteria, while maintaining convergence guarantees.
Paper Structure (30 sections, 2 theorems, 40 equations, 5 figures, 2 tables)

This paper contains 30 sections, 2 theorems, 40 equations, 5 figures, 2 tables.

Key Result

Lemma 5.1

Let $\mathbf{A}=[\mathbf{A}_1~\cdots~\mathbf{A}_I]\in\mathbb{R}^{mJ\times IJ}$ with user blocks $\mathbf{A}_i\in\mathbb{R}^{mJ\times J}$ that are i.i.d. across $i$ and diagonal by rows as in Definition def:matching-constraints. Let $\mathbf{D}_{\mathrm{exp}}=\mathrm{diag}\!(\mathbb{E}\|\mathbf{A}_{1 then

Figures (5)

  • Figure 1: Scala–DuaLip (PyTorch) parity. Each panel shows the dual objective versus AGD iteration for the Scala and PyTorch implementations. The near-perfect overlap confirms numerical equivalence.
  • Figure 2: Relative error in dual objective compared to the Scala solver. The error drops below 1% within the first 100 iterations across all settings.
  • Figure 3: Scaling behavior across GPUs. (Left) Solve time versus number of GPUs for problem sizes between 25M and 100M sources. (Right) Speedup relative to a single GPU, approaching ideal linear scaling.
  • Figure 4: Effect of diagonal preconditioning. We report $\log(|L - \hat{L}|)$ for a 25M-source instance (10k destinations, 0.1% sparsity). Preconditioning significantly improves early-stage convergence.
  • Figure 5: Effect of regularization continuation. Decaying $\gamma$ during optimization accelerates convergence while preserving solution fidelity.

Theorems & Definitions (3)

  • Definition 1: Complex constraint matrix for matching problems
  • Lemma 5.1
  • Lemma A.1