GPU Implementation of Second-Order Linear and Nonlinear Programming Solvers
Alexis Montoison, François Pacaud, Sungho Shin, Mihai Anitescu
TL;DR
The paper surveys GPU-accelerated second-order optimization solvers for large, sparse problems of the form $\min_x f(x)$ s.t. $g(x) \ge 0$, with emphasis on pivoting-free interior-point methods and direct sparse solvers like cuDSS. It analyzes direct solvers, discusses LDL$^\top$ factorization and the challenges of numerical pivoting on GPUs, and explains how regularization and condensation render KKT systems SPD or SQD to enable pivoting-free solutions. It introduces MadIPM and MadNLP as GPU-native solvers and demonstrates substantial speedups (often >10x) on large-scale benchmarks (MIPLIB, PGLIB-OPF, COPS) at medium precision, while acknowledging robustness limits at high precision and outlining future directions. The work also covers algebraic modeling and AD approaches (ExaModels.jl, GPU kernels) to support fully GPU-resident workflows, highlighting open challenges in stability, batching, and hardware portability for broader practical impact.
Abstract
In recent years, GPU-accelerated optimization solvers based on second-order methods (e.g., interior-point methods) have gained momentum with the advent of mature and efficient GPU-accelerated direct sparse linear solvers, such as cuDSS. This paper provides an overview of the state of the art in GPU-based second-order solvers, focusing on pivoting-free interior-point methods for large and sparse linear and nonlinear programs. We begin by highlighting the capabilities and limitations of the currently available GPU-accelerated sparse linear solvers. Next, we discuss different formulations of the Karush-Kuhn-Tucker systems for second-order methods and evaluate their suitability for pivoting-free GPU implementations. We also discuss strategies for computing sparse Jacobians and Hessians on GPUs for nonlinear programming. Finally, we present numerical experiments demonstrating the scalability of GPU-based optimization solvers. We observe speedups often exceeding 10x compared to comparable CPU implementations on large-scale instances when solved up to medium precision. Additionally, we examine the current limitations of existing approaches.
