Table of Contents
Fetching ...

DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients

Yuexin Bian, Jie Feng, Yuanyuan Shi

TL;DR

DiffOP addresses learning control policies defined implicitly by optimization problems, avoiding value-function approximation by directly differentiating through the optimization layer. It jointly learns cost and dynamics models and derives analytical policy gradients via implicit differentiation and Pontryagin’s Maximum Principle, enabling end-to-end RL with model-based components. The authors prove a non-asymptotic convergence guarantee to an $\epsilon$-stationary point in $\mathcal{O}(\epsilon^{-1})$ iterations and validate DiffOP on nonlinear controllers and voltage regulation with constraints, achieving superior performance against RL-based MPC and differentiable optimization baselines. Code and experiments demonstrate practical applicability across challenging control tasks with constraints and nonlinearity.

Abstract

Real-world control systems require policies that are not only high-performing but also interpretable and robust. A promising direction toward this goal is model-based control, which learns system dynamics and cost functions from historical data and then uses these models to inform decision-making. Building on this paradigm, we introduce DiffOP, a novel framework for learning optimization-based control policies defined implicitly through optimization control problems. Without relying on value function approximation, DiffOP jointly learns the cost and dynamics models and directly optimizes the actual control costs using policy gradients. To enable this, we derive analytical policy gradients by applying implicit differentiation to the underlying optimization problem and integrating it with the standard policy gradient framework. Under standard regularity conditions, we establish that DiffOP converges to an $ε$-stationary point within $\mathcal{O}(ε^{-1})$ iterations. We demonstrate the effectiveness of DiffOP through experiments on nonlinear control tasks and power system voltage control with constraints. The code is available at https://github.com/alwaysbyx/DiffOP.

DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients

TL;DR

DiffOP addresses learning control policies defined implicitly by optimization problems, avoiding value-function approximation by directly differentiating through the optimization layer. It jointly learns cost and dynamics models and derives analytical policy gradients via implicit differentiation and Pontryagin’s Maximum Principle, enabling end-to-end RL with model-based components. The authors prove a non-asymptotic convergence guarantee to an -stationary point in iterations and validate DiffOP on nonlinear controllers and voltage regulation with constraints, achieving superior performance against RL-based MPC and differentiable optimization baselines. Code and experiments demonstrate practical applicability across challenging control tasks with constraints and nonlinearity.

Abstract

Real-world control systems require policies that are not only high-performing but also interpretable and robust. A promising direction toward this goal is model-based control, which learns system dynamics and cost functions from historical data and then uses these models to inform decision-making. Building on this paradigm, we introduce DiffOP, a novel framework for learning optimization-based control policies defined implicitly through optimization control problems. Without relying on value function approximation, DiffOP jointly learns the cost and dynamics models and directly optimizes the actual control costs using policy gradients. To enable this, we derive analytical policy gradients by applying implicit differentiation to the underlying optimization problem and integrating it with the standard policy gradient framework. Under standard regularity conditions, we establish that DiffOP converges to an -stationary point within iterations. We demonstrate the effectiveness of DiffOP through experiments on nonlinear control tasks and power system voltage control with constraints. The code is available at https://github.com/alwaysbyx/DiffOP.

Paper Structure

This paper contains 22 sections, 6 theorems, 71 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Suppose $u^\star$ is the solution of the optimization-based policy eq:op_policy and denote $\zeta^\star$ as the resulting trajectory. Assume $c(\cdot), c_H(\cdot), f(\cdot), g(\cdot)$ are twice differentiable in a neighborhood of $(\theta, \zeta^\star)$. Let If $\text{rank}(A)=n_\kappa$ and $D$ is non-singular, then the gradient $\nabla_{\theta} u_i^\star$ takes the following form, with where $

Figures (5)

  • Figure 1: Overview of the DiffOP framework. An optimization-based control policy generates an action sequence by solving a parameterized optimal control problem with learnable dynamics and cost models. The environment executes the action sequence and returns cost feedback, which is used to compute policy gradients for updating the parameters.
  • Figure 2: Control cost versus training iteration on nonlinear control tasks. Solid lines indicate the mean cost over 5 runs, and shaded regions denote the 20th to 80th percentile range.
  • Figure : (a) DiffOP(Traj)
  • Figure : (a) DiffOP(Traj)
  • Figure : (b) DiffOP(Step)

Theorems & Definitions (15)

  • Proposition 1: Gradient of the optimization-based policy
  • Proposition 2: Policy gradient update
  • proof
  • Theorem 1: Convergence of DiffOP with Policy Gradient
  • proof
  • Proposition 3
  • proof
  • proof
  • proof
  • Lemma 1: Smoothness of $C(\theta)$
  • ...and 5 more