Self-Supervised Learning of Iterative Solvers for Constrained Optimization

Lukas Lüken; Sergio Lucia

Self-Supervised Learning of Iterative Solvers for Constrained Optimization

Lukas Lüken, Sergio Lucia

TL;DR

The paper tackles the challenge of real-time, high-accuracy solutions to parametric constrained optimization, especially in model predictive control, where traditional solvers struggle under tight time constraints. It introduces LISCO, a two-stage, learning-based solver comprising a predictor for warm-start primal-dual estimates and a solver that iteratively refines these estimates using updates guided by a differentiable KKT residual, all trained in a fully self-supervised fashion through a KKT-based loss. A convexification strategy enables application to nonconvex problems while preserving the theoretical link between the training loss and KKT points, with guarantees that minima of the per-sample loss align with KKT points. Empirical results on NMPC with a nonlinear double integrator and a high-dimensional nonconvex QP show substantial online speedups over IPOPT and higher accuracy than competing learning-based baselines such as DC3 and PDL, highlighting LISCO’s potential for real-time certified optimization in complex control tasks.

Abstract

The real-time solution of parametric optimization problems is critical for applications that demand high accuracy under tight real-time constraints, such as model predictive control. To this end, this work presents a learning-based iterative solver for constrained optimization, comprising a neural network predictor that generates initial primal-dual solution estimates, followed by a learned iterative solver that refines these estimates to reach high accuracy. We introduce a novel loss function based on Karush-Kuhn-Tucker (KKT) optimality conditions, enabling fully self-supervised training without pre-sampled optimizer solutions. Theoretical guarantees ensure that the training loss function attains minima exclusively at KKT points. A convexification procedure enables application to nonconvex problems while preserving these guarantees. Experiments on two nonconvex case studies demonstrate speedups of up to one order of magnitude compared to state-of-the-art solvers such as IPOPT, while achieving orders of magnitude higher accuracy than competing learning-based approaches.

Self-Supervised Learning of Iterative Solvers for Constrained Optimization

TL;DR

Abstract

Paper Structure (19 sections, 2 theorems, 45 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 2 theorems, 45 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Background
Parametric Nonlinear Constrained Optimization
Optimality Condition Reformulation
Nonlinear Model Predictive Control
Proposed Method
Learning-Based Iterative Solver for Constrained Optimization (LISCO)
Predictor
Solver
A Novel Loss Function for Self-Supervised Training
Theoretical Properties of the Loss Function
Implementation Details
Solver Network Architecture
Convexification Procedure
Self-Supervised Training
...and 4 more sections

Key Result

Lemma 1

Given Assumptions assumption:1 to assumption:5, there exist no other local optima of the per-sample loss function eq:loss_function_per_sample than those points $(\hat{\mathbf{z}}^{i}, \mathbf{p}^{i})$ that satisfy the modified KKT conditions eq:r_phi. This means that the derivative of the loss funct

Figures (6)

Figure 1: Overview of the LISCO architecture, consisting of a predictor network that provides an initial primal-dual estimate $\hat{\mathbf{z}}_0$ based on the problem parameters $\mathbf{p}$ and a solver network that iteratively refines this estimate by predicting update steps $\Delta \hat{\mathbf{z}}_k$ based on the current primal-dual iterate $\hat{\mathbf{z}}_k$ and the problem parameters $\mathbf{p}$. Both networks are trained in a self-supervised manner using a novel loss function based on the KKT conditions.
Figure 2: Architecture of the solver network (detailed view from Fig. \ref{['fig:solver_figure']}). The neural network $\Psi_{\theta}$ takes normalized KKT residuals $\boldsymbol{\tau}_k$ and problem parameters $\mathbf{p}$ as inputs, where $\boldsymbol{\tau}_k$ contains both the violation direction and magnitude information (equation \ref{['eq:tau']}). The network output is scaled by the KKT residual 2-norm $\|\mathbf{r}_k\|_2$ and a problem-specific factor $\gamma$ to produce steps $\Delta \hat{\mathbf{z}}_k$ (equation \ref{['eq:solver_step']}).
Figure 3: Convergence of KKT residuals \ref{['eq:kkt_conditions']} over solver iterations for the nonlinear MPC problem on a test dataset of $N_{\text{test}}=5000$ parameter instances $\mathbf{p}^{i}$. The predictor network is used to initialize the solver, which then refines the solution iteratively. For each iteration $k$, the figure shows percentiles of the KKT residual infinity norm across all test instances: 50th percentile (median), 90th, 95th, 99th percentiles, and the maximum value. A tolerance of $\textrm{1e-6}$ is used to determine convergence.
Figure 4: Histograms of speedup factors achieved by LISCO compared to IPOPT for the NMPC problem ($N_{\text{test}} = 5000$). The figure shows two distributions: LISCO with predictor (blue) and LISCO without predictor (orange). The histograms display the distribution of runtime ratios (IPOPT time / LISCO time), where values greater than 1 indicate LISCO is faster than IPOPT. A tolerance of $\textrm{1e-6}$ on the KKT residual infinity norm is used for all methods. The percentage of runs where LISCO is faster than IPOPT is indicated in the legend.
Figure 5: Convergence of KKT residuals \ref{['eq:kkt_conditions']} over solver iterations for the nonconvex QP problem, aggregated across 5 independent problem instances with $N_{\text{test}}=1000$ parameter instances $\mathbf{p}^{i}$ each. The predictor network is used to initialize the solver, which then refines the solution iteratively. For each iteration $k$, the solid lines show the median values across the 5 problem instances for different percentiles of the KKT residual infinity norm: 50th percentile (median), 90th, 95th, 99th percentiles, and the maximum value. The shaded areas indicate the range from minimum to maximum values across the 5 problem instances. A tolerance of $\textrm{1e-6}$ is used to determine convergence.
...and 1 more figures

Theorems & Definitions (4)

Lemma 1
proof
Theorem 1
proof

Self-Supervised Learning of Iterative Solvers for Constrained Optimization

TL;DR

Abstract

Self-Supervised Learning of Iterative Solvers for Constrained Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)