Table of Contents
Fetching ...

Constraint-Generation Policy Optimization (CGPO): Nonlinear Programming for Policy Optimization in Mixed Discrete-Continuous MDPs

Michael Gimelfarb, Ayal Taitler, Scott Sanner

TL;DR

Constraint-Generation Policy Optimization (CGPO) introduces a bilevel mixed-integer framework to optimize policies within compact, interpretable expressivity classes for mixed discrete-continuous MDPs (DC-MDPs). The core idea is to iteratively generate worst-case trajectory constraints (outer problem) while solving an inner problem that identifies policy parameters, producing guaranteed bounded policy performance and, when termination occurs, optimal solutions within the chosen policy class. The method extends to stochastic DC-MDPs via chance constraints, delivering high-probability performance guarantees. The authors provide a problem-expressivity roadmap, demonstrate CGPO on inventory, reservoir, VTOL, and interception domains, and emphasize the interpretability and worst-case diagnostic capabilities of the resulting policies, albeit acknowledging computational demands for larger problems.

Abstract

We propose the Constraint-Generation Policy Optimization (CGPO) framework to optimize policy parameters within compact and interpretable policy classes for mixed discrete-continuous Markov Decision Processes (DC-MDP). CGPO can not only provide bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics, but it can also provably derive optimal policies in cases where it terminates with zero error. Furthermore, CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions. To achieve such results, CGPO proposes a bilevel mixed-integer nonlinear optimization framework for optimizing policies in defined expressivity classes (e.g. piecewise linear) and reduces it to an optimal constraint generation methodology that adversarially generates worst-case state trajectories. Furthermore, leveraging modern nonlinear optimizers, CGPO can obtain solutions with bounded optimality gap guarantees. We handle stochastic transitions through chance constraints, providing high-probability performance guarantees. We also present a roadmap for understanding the computational complexities of different expressivity classes of policy, reward, and transition dynamics. We experimentally demonstrate the applicability of CGPO across various domains, including inventory control, management of a water reservoir system, and physics control. In summary, CGPO provides structured, compact and explainable policies with bounded performance guarantees, enabling worst-case scenario generation and counterfactual policy diagnostics.

Constraint-Generation Policy Optimization (CGPO): Nonlinear Programming for Policy Optimization in Mixed Discrete-Continuous MDPs

TL;DR

Constraint-Generation Policy Optimization (CGPO) introduces a bilevel mixed-integer framework to optimize policies within compact, interpretable expressivity classes for mixed discrete-continuous MDPs (DC-MDPs). The core idea is to iteratively generate worst-case trajectory constraints (outer problem) while solving an inner problem that identifies policy parameters, producing guaranteed bounded policy performance and, when termination occurs, optimal solutions within the chosen policy class. The method extends to stochastic DC-MDPs via chance constraints, delivering high-probability performance guarantees. The authors provide a problem-expressivity roadmap, demonstrate CGPO on inventory, reservoir, VTOL, and interception domains, and emphasize the interpretability and worst-case diagnostic capabilities of the resulting policies, albeit acknowledging computational demands for larger problems.

Abstract

We propose the Constraint-Generation Policy Optimization (CGPO) framework to optimize policy parameters within compact and interpretable policy classes for mixed discrete-continuous Markov Decision Processes (DC-MDP). CGPO can not only provide bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics, but it can also provably derive optimal policies in cases where it terminates with zero error. Furthermore, CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions. To achieve such results, CGPO proposes a bilevel mixed-integer nonlinear optimization framework for optimizing policies in defined expressivity classes (e.g. piecewise linear) and reduces it to an optimal constraint generation methodology that adversarially generates worst-case state trajectories. Furthermore, leveraging modern nonlinear optimizers, CGPO can obtain solutions with bounded optimality gap guarantees. We handle stochastic transitions through chance constraints, providing high-probability performance guarantees. We also present a roadmap for understanding the computational complexities of different expressivity classes of policy, reward, and transition dynamics. We experimentally demonstrate the applicability of CGPO across various domains, including inventory control, management of a water reservoir system, and physics control. In summary, CGPO provides structured, compact and explainable policies with bounded performance guarantees, enabling worst-case scenario generation and counterfactual policy diagnostics.
Paper Structure (45 sections, 1 theorem, 28 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 45 sections, 1 theorem, 28 equations, 14 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

If $\mathcal{S}_1$, $\mathcal{A}$, $\Xi_p$ and $\mathcal{W}$ are non-empty compact subsets of Euclidean space, $V(\alpha, \mathbf{s}_{}, \xi_{1:T})$ and $V(\mathbf{w}, \mathbf{s}_{}, \xi_{1:T})$ are continuous, and CGPO terminates at iteration $t$, then $\mathbf{w}_t^*$ is optimal for problem (eqn:b

Figures (14)

  • Figure 1: Reservoir control is used as an illustrative example. Left: an overview of CGPO, which consists of a domain description and policy representation compiled to a bilevel mixed-integer program (MIP), in which the inner problem computes the worst-case trajectories for the current policy while the outer problem updates the policy via constraint-generation. The result is a worst-case scenario for the policy (facilitating policy failure analysis), a concrete policy within the expressivity class (for direct policy inspection), and a gap on its performance (error bound). Right: three optimal (i.e. zero-gap) policies produced upon termination across several piecewise policy classes. Crucially, our framework provides the ability to derive highly compact (e.g. memory and time-efficient to execute), intuitive and nonlinear policies, with strong bound guarantees on policy performance.
  • Figure 2: Nonlinear VTOL system.
  • Figure 3: Reservoir (left), Inventory (middle), VTOL (right): Simulated return (top row) and worst-case error $\varepsilon^*$ (bottom row) over 100 roll-outs as a function of the number of iterations of constraint generation. Bars represent 95% confidence intervals calculated using 10 independent runs of CGPO.
  • Figure 4: Reservoir (left), Inventory (middle), VTOL (right): Examples of optimal policies computed at the end of CGPO.
  • Figure 5: Inventory: Examples of C, S, PWS1-C and PWS2-C (left to right) policies computed by CGPO.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof