Table of Contents
Fetching ...

Cyqlone: A Parallel, High-Performance Linear Solver for Optimal Control

Pieter Pas, Panagiotis Patrinos

TL;DR

Cyqlone introduces a highly parallel, high-performance solver for linear-quadratic OCPs by unifying modified Riccati recursion, parallel Schur-complement techniques, and cyclic reduction. It partitions the horizon across P processors to achieve near-ideal scaling, while leveraging batch-wise vectorization to accelerate small-matrix operations beyond traditional libraries. The companion CyQPALM exploits Cyqlone as the linear solver within a proximal augmented Lagrangian framework, delivering large speedups over state-of-the-art solvers, especially for long horizons and warm-started MPC-like tasks. Together, these methods enable real-time solutions for long-horizon OCPs on modern multi-core hardware and provide open-source implementations for practitioners.

Abstract

We present Cyqlone, a solver for linear systems with a stage-wise optimal control structure that fully exploits the various levels of parallelism available in modern hardware. Cyqlone unifies algorithms based on the sequential Riccati recursion, parallel Schur complement methods, and cyclic reduction methods, thereby minimizing the required number of floating-point operations, while allowing parallelization across a user-configurable number of processors. Given sufficient parallelism, the solver run time scales with the logarithm of the horizon length (in contrast to the linear scaling of sequential Riccati-based methods), enabling real-time solution of long-horizon problems. Beyond multithreading on multi-core processors, implementations of Cyqlone can also leverage vectorization using batched linear algebra routines. Such batched routines exploit data parallelism using single instruction, multiple data (SIMD) operations, and expose a higher degree of instruction-level parallelism than their non-batched counterparts. This enables them to significantly outperform BLAS and BLASFEO for the small matrices that arise in optimal control. Building on this high-performance linear solver, we develop CyQPALM, a parallel and optimal-control-specific variant of the QPALM quadratic programming solver. It combines the parallel and vectorized linear algebra operations from Cyqlone with a parallel line search and parallel factorization updates, resulting in order-of-magnitude speedups compared to the state-of-the-art HPIPM solver. Open-source C++ implementations of Cyqlone and CyQPALM are available at https://github.com/kul-optec/cyqlone

Cyqlone: A Parallel, High-Performance Linear Solver for Optimal Control

TL;DR

Cyqlone introduces a highly parallel, high-performance solver for linear-quadratic OCPs by unifying modified Riccati recursion, parallel Schur-complement techniques, and cyclic reduction. It partitions the horizon across P processors to achieve near-ideal scaling, while leveraging batch-wise vectorization to accelerate small-matrix operations beyond traditional libraries. The companion CyQPALM exploits Cyqlone as the linear solver within a proximal augmented Lagrangian framework, delivering large speedups over state-of-the-art solvers, especially for long horizons and warm-started MPC-like tasks. Together, these methods enable real-time solutions for long-horizon OCPs on modern multi-core hardware and provide open-source implementations for practitioners.

Abstract

We present Cyqlone, a solver for linear systems with a stage-wise optimal control structure that fully exploits the various levels of parallelism available in modern hardware. Cyqlone unifies algorithms based on the sequential Riccati recursion, parallel Schur complement methods, and cyclic reduction methods, thereby minimizing the required number of floating-point operations, while allowing parallelization across a user-configurable number of processors. Given sufficient parallelism, the solver run time scales with the logarithm of the horizon length (in contrast to the linear scaling of sequential Riccati-based methods), enabling real-time solution of long-horizon problems. Beyond multithreading on multi-core processors, implementations of Cyqlone can also leverage vectorization using batched linear algebra routines. Such batched routines exploit data parallelism using single instruction, multiple data (SIMD) operations, and expose a higher degree of instruction-level parallelism than their non-batched counterparts. This enables them to significantly outperform BLAS and BLASFEO for the small matrices that arise in optimal control. Building on this high-performance linear solver, we develop CyQPALM, a parallel and optimal-control-specific variant of the QPALM quadratic programming solver. It combines the parallel and vectorized linear algebra operations from Cyqlone with a parallel line search and parallel factorization updates, resulting in order-of-magnitude speedups compared to the state-of-the-art HPIPM solver. Open-source C++ implementations of Cyqlone and CyQPALM are available at https://github.com/kul-optec/cyqlone

Paper Structure

This paper contains 51 sections, 50 equations, 23 figures, 4 tables.

Figures (23)

  • Figure 1: Performance of single-threaded and multithreaded Cholesky factorization (potrf) routines from the Intel MKL, on an Intel Core Ultra 7 265. The multithreaded variant becomes more performant than the single-threaded variant for matrices of size $192\times 192$ and larger.
  • Figure 2: Performance of different (fused) symmetric rank-$k$ update (syrk) and Cholesky factorization (potrf) routines frison_algorithms_2016 when applied to batches of eight square matrices $A$ and $C$. The Batmat implementation using batch-wise vectorization tttapa_batmat_2025 significantly outperforms the BLASFEO and Intel MKL BLAS implementations for small matrices. Unlike Batmat and BLASFEO, the Intel MKL does not provide a fused syrk+potrf implementation, so the individual syrk and potrf routines are used instead. Experiments were carried out on an Intel Core Ultra 7 265 with a sustained clock speed of 5.2 GHz and a theoretical peak AVX2 throughput of 41.6 GFLOPS, using BLASFEO 0.1.4.2 and version 2025.0.1 of the Intel MKL.
  • Figure 3: Visualization of steps \ref{['it:isolate-H11']}--\ref{['it:update-schur-compl']} of the block Cholesky factorization of a sparse symmetric matrix $H$ as in \ref{['eq:blk-chol']}. Nonzero blocks of $H$ are indicated by crosses ($\times$). Blocks that are modified by the elimination of $H_{11}$ and $H_{21}$ are shown in bold, and blocks of fill-in in the Schur complement are shown in red.
  • Figure 4: Top: a $16\times 16$ symmetric block-tridiagonal matrix with its block Cholesky factor and the corresponding elimination tree. No fill-in is incurred, but the single linear path in the elimination tree highlights the sequential nature of this factorization procedure. Bottom: cyclic reduction as the factorization of a permutation of the same block-tridiagonal matrix, along with its Cholesky factor (with fill-in generated during the factorization shown in gray and labels matching the derivation in \ref{['subsec:cr-elim-deriv']}). Blocks in the square outlines along the diagonal can be factorized simultaneously. The parallel branches in the elimination tree visualize the available parallelism during the Cholesky factorization of the permuted matrix.
  • Figure 5: Cyqlone ordering of the coefficient matrix of a KKT system with optimal control structure \ref{['eq:opt-cond-ocp-eq']} with horizon $N=12$ for $P=4$ processors. Grey blocks indicate fill-in incurred during the factorization of the matrix. Stars ($\star$) mark structured or redundant fill-in (see \ref{['sec:mod-ricc']}). The corresponding elimination tree visualizes the available parallelism during the factorization.
  • ...and 18 more figures