Table of Contents
Fetching ...

Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality

Andrew Rosemberg, Alexandre Street, Davi M. Valladão, Pascal Van Hentenryck

TL;DR

The paper tackles constrained Markov decision processes with hard nonlinear constraints by introducing Two-Stage General Decision Rules (TS-GDR) and its deep policy instantiation TS-DDR, trained via forward deterministic optimization and backward dual-gradient updates. TS-DDR learns non-linear, time-invariant policies that can be evaluated in real time, leveraging duality to obtain closed-form gradients without implicit differentiation. The authors apply TS-DDR to the Long-Term Hydrothermal Dispatch problem under multiple power-flow formulations (DCLL, SOC, AC) using Bolivian system data, showing orders-of-magnitude improvements in training and inference times and superior solution quality, particularly in non-convex settings where traditional SDDP or model-free RL falter. The results demonstrate the practical impact of a hybrid ML/stochastic-optimization approach for high-stakes, sequential decision tasks and motivate broader exploration of TS-GDR in domains with complex constraints and real-time requirements.

Abstract

Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications, where decisions must optimize cumulative rewards while strictly adhering to complex nonlinear constraints. In domains such as power systems, finance, supply chains, and precision robotics, violating these constraints can result in significant financial or societal costs. Existing Reinforcement Learning (RL) methods often struggle with sample efficiency and effectiveness in finding feasible policies for highly and strictly constrained CMDPs, limiting their applicability in these environments. Stochastic dual dynamic programming is often used in practice on convex relaxations of the original problem, but they also encounter computational challenges and loss of optimality. This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS-DDR), to efficiently train parametric actor policies using Lagrangian Duality. TS-DDR is a self-supervised learning algorithm that trains general decision rules (parametric policies) using stochastic gradient descent (SGD); its forward passes solve {\em deterministic} optimization problems to find feasible policies, and its backward passes leverage duality theory to train the parametric policy with closed-form gradients. TS-DDR inherits the flexibility and computational performance of deep learning methodologies to solve CMDP problems. Applied to the Long-Term Hydrothermal Dispatch (LTHD) problem using actual power system data from Bolivia, TS-DDR is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.

Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality

TL;DR

The paper tackles constrained Markov decision processes with hard nonlinear constraints by introducing Two-Stage General Decision Rules (TS-GDR) and its deep policy instantiation TS-DDR, trained via forward deterministic optimization and backward dual-gradient updates. TS-DDR learns non-linear, time-invariant policies that can be evaluated in real time, leveraging duality to obtain closed-form gradients without implicit differentiation. The authors apply TS-DDR to the Long-Term Hydrothermal Dispatch problem under multiple power-flow formulations (DCLL, SOC, AC) using Bolivian system data, showing orders-of-magnitude improvements in training and inference times and superior solution quality, particularly in non-convex settings where traditional SDDP or model-free RL falter. The results demonstrate the practical impact of a hybrid ML/stochastic-optimization approach for high-stakes, sequential decision tasks and motivate broader exploration of TS-GDR in domains with complex constraints and real-time requirements.

Abstract

Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications, where decisions must optimize cumulative rewards while strictly adhering to complex nonlinear constraints. In domains such as power systems, finance, supply chains, and precision robotics, violating these constraints can result in significant financial or societal costs. Existing Reinforcement Learning (RL) methods often struggle with sample efficiency and effectiveness in finding feasible policies for highly and strictly constrained CMDPs, limiting their applicability in these environments. Stochastic dual dynamic programming is often used in practice on convex relaxations of the original problem, but they also encounter computational challenges and loss of optimality. This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS-DDR), to efficiently train parametric actor policies using Lagrangian Duality. TS-DDR is a self-supervised learning algorithm that trains general decision rules (parametric policies) using stochastic gradient descent (SGD); its forward passes solve {\em deterministic} optimization problems to find feasible policies, and its backward passes leverage duality theory to train the parametric policy with closed-form gradients. TS-DDR inherits the flexibility and computational performance of deep learning methodologies to solve CMDP problems. Applied to the Long-Term Hydrothermal Dispatch (LTHD) problem using actual power system data from Bolivia, TS-DDR is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.
Paper Structure (28 sections, 22 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 22 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: LTHD Training Curves for Model-Free RL Methods and TS-GDR Baseline
  • Figure 2: Expected Stored Energy for the AC Formulation.
  • Figure 3: Expected Thermal Dispatch for the AC Formulation.
  • Figure 4: Expected stored energy and thermal dispatch over time for the DCLL formulation.
  • Figure 5: Expected stored energy and thermal dispatch over time for the SOC formulation.