Table of Contents
Fetching ...

Iterative Methods in GPU-Resident Linear Solvers for Nonlinear Constrained Optimization

Kasia Świrydowicz, Nicholson Koukpaizan, Maksudul Alam, Shaked Regev, Michael Saunders, Slaven Peleš

TL;DR

The paper tackles the bottleneck of solving ill-conditioned linear systems within nonlinear constrained optimization on heterogeneous hardware. It introduces a GPU-aware strategy that couples LU-based refactorization with iterative refinement via FGMRES to accelerate solution of KKT systems, while reusing data structures to minimize data movement. Empirical results show substantial performance gains over CPU baselines, especially when iterative refinement is tuned and integrated with full optimization stacks like ExaGOTM/HiOp, and they compare favorably to HyKKT in large-scale problems. The work also demonstrates the value of standalone testing for predicting application-level performance and discusses recommendations for future codesign between optimization solvers and linear algebra routines to exploit GPUs effectively.

Abstract

Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra methods are often inefficient. For example, methods for solving ill-conditioned linear systems have relied on conditional branching, which degrades performance on hardware accelerators such as graphical processing units (GPUs). To improve the efficiency of solving ill-conditioned systems, our computational strategy separates computations that are efficient on GPUs from those that need to run on traditional central processing units (CPUs). Our strategy maximizes the reuse of expensive CPU computations. Iterative methods, which thus far have not been broadly used for ill-conditioned linear systems, play an important role in our approach. In particular, we extend ideas from [1] to implement iterative refinement using inexact LU factors and flexible generalized minimal residual (FGMRES), with the aim of efficient performance on GPUs. We focus on solutions that are effective within broader application contexts, and discuss how early performance tests could be improved to be more predictive of the performance in a realistic environment

Iterative Methods in GPU-Resident Linear Solvers for Nonlinear Constrained Optimization

TL;DR

The paper tackles the bottleneck of solving ill-conditioned linear systems within nonlinear constrained optimization on heterogeneous hardware. It introduces a GPU-aware strategy that couples LU-based refactorization with iterative refinement via FGMRES to accelerate solution of KKT systems, while reusing data structures to minimize data movement. Empirical results show substantial performance gains over CPU baselines, especially when iterative refinement is tuned and integrated with full optimization stacks like ExaGOTM/HiOp, and they compare favorably to HyKKT in large-scale problems. The work also demonstrates the value of standalone testing for predicting application-level performance and discusses recommendations for future codesign between optimization solvers and linear algebra routines to exploit GPUs effectively.

Abstract

Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra methods are often inefficient. For example, methods for solving ill-conditioned linear systems have relied on conditional branching, which degrades performance on hardware accelerators such as graphical processing units (GPUs). To improve the efficiency of solving ill-conditioned systems, our computational strategy separates computations that are efficient on GPUs from those that need to run on traditional central processing units (CPUs). Our strategy maximizes the reuse of expensive CPU computations. Iterative methods, which thus far have not been broadly used for ill-conditioned linear systems, play an important role in our approach. In particular, we extend ideas from [1] to implement iterative refinement using inexact LU factors and flexible generalized minimal residual (FGMRES), with the aim of efficient performance on GPUs. We focus on solutions that are effective within broader application contexts, and discuss how early performance tests could be improved to be more predictive of the performance in a realistic environment
Paper Structure (21 sections, 12 equations, 8 figures, 5 tables)

This paper contains 21 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Norm of scaled residuals computed with different solver strategies for the largest standalone test case series. Refactorization without iterative refinement (cyan squares) produces solutions of insufficient quality. The refactorization approach with our iterative refinement (yellow squares) delivers solution quality comparable to that of solvers that perform full numerical factorization for each system, and better quality than iterative refinement implemented in cusolverGLU (red diamonds).
  • Figure 2: Timing results for the cusolverRf with and without fgmres iterative refinement for the largest standalone test case series (green circles and red diamonds, respectively). The tolerance for fgmres was $10^{-14}$. The iterative refinement adds non-negligible overhead but the computation still outperforms the MA57 baseline.
  • Figure 3: Our iterative refinement approach (green triangles) performs on par with Richardson-style iterations (magenta circles) and reduces norm of scaled residuals to below machine precision (left figure), while requiring fewer triangular solves overall (right). Note: outlier cases have been removed. FGMRES(20) was used in the numerical experiments.
  • Figure 4: Comparison of the cost of each matrix factorization when different linear solvers are nominally used for acopf on olcf Summit. Each gpu factorization (GLU in red and Rf in green) outperforms each cpu factorization (MA57 in blue).
  • Figure 5: Comparison of the average computational cost per iteration of most expensive operations when different linear solvers are used for acopf on olcf Summit. The cost of the first step, which is executed on cpu, is accounted for in the averages.
  • ...and 3 more figures