Table of Contents
Fetching ...

MAD: A Magnitude And Direction Policy Parametrization for Stability Constrained Reinforcement Learning

Luca Furieri, Sucheth Shenoy, Danilo Saccani, Andrea Martin, Giancarlo Ferrari-Trecate

TL;DR

This work tackles stability guarantees in reinforcement learning for nonlinear, discrete-time systems by introducing magnitude-and-direction (MAD) policies. MAD decouples stability from expressivity by placing the stable, $\\ ext{L}_p$-bounded magnitude in a fixed operator while learning a flexible state-dependent direction, allowing integration with model-free RL. The authors show that, with model knowledge, MAD can realize all stabilizing controllers and strictly expands the set of achievable closed-loop behaviors beyond disturbance-feedback (DF) policies; under model uncertainty, MAD remains stabilizing under a quantitative robustness condition and retains state-feedback capabilities even when the disturbance reconstruction is unreliable. Empirical results on a corridor navigation task demonstrate that MAD policies generalize as well as standard neural policies, while guaranteeing closed-loop stability by design, highlighting practical benefits for safety-critical RL.

Abstract

We introduce magnitude and direction (MAD) policies, a policy parameterization for reinforcement learning (RL) that preserves Lp closed-loop stability for nonlinear dynamical systems. Despite their completeness in describing all stabilizing controllers, methods based on nonlinear Youla and system-level synthesis are significantly impacted by the difficulty of parametrizing Lp-stable operators. In contrast, MAD policies introduce explicit feedback on state-dependent features - a key element behind the success of reinforcement learning pipelines - without jeopardizing closed-loop stability. This is achieved by letting the magnitude of the control input be described by a disturbance-feedback Lp-stable operator, while selecting its direction based on state-dependent features through a universal function approximator. We further characterize the robust stability properties of MAD policies under model mismatch. Unlike existing disturbance-feedback policy parametrizations, MAD policies introduce state-feedback components compatible with model-free RL pipelines, ensuring closed-loop stability with no model information beyond assuming open-loop stability. Numerical experiments show that MAD policies trained with deep deterministic policy gradient (DDPG) methods generalize to unseen scenarios - matching the performance of standard neural network policies while guaranteeing closed-loop stability by design.

MAD: A Magnitude And Direction Policy Parametrization for Stability Constrained Reinforcement Learning

TL;DR

This work tackles stability guarantees in reinforcement learning for nonlinear, discrete-time systems by introducing magnitude-and-direction (MAD) policies. MAD decouples stability from expressivity by placing the stable, -bounded magnitude in a fixed operator while learning a flexible state-dependent direction, allowing integration with model-free RL. The authors show that, with model knowledge, MAD can realize all stabilizing controllers and strictly expands the set of achievable closed-loop behaviors beyond disturbance-feedback (DF) policies; under model uncertainty, MAD remains stabilizing under a quantitative robustness condition and retains state-feedback capabilities even when the disturbance reconstruction is unreliable. Empirical results on a corridor navigation task demonstrate that MAD policies generalize as well as standard neural policies, while guaranteeing closed-loop stability by design, highlighting practical benefits for safety-critical RL.

Abstract

We introduce magnitude and direction (MAD) policies, a policy parameterization for reinforcement learning (RL) that preserves Lp closed-loop stability for nonlinear dynamical systems. Despite their completeness in describing all stabilizing controllers, methods based on nonlinear Youla and system-level synthesis are significantly impacted by the difficulty of parametrizing Lp-stable operators. In contrast, MAD policies introduce explicit feedback on state-dependent features - a key element behind the success of reinforcement learning pipelines - without jeopardizing closed-loop stability. This is achieved by letting the magnitude of the control input be described by a disturbance-feedback Lp-stable operator, while selecting its direction based on state-dependent features through a universal function approximator. We further characterize the robust stability properties of MAD policies under model mismatch. Unlike existing disturbance-feedback policy parametrizations, MAD policies introduce state-feedback components compatible with model-free RL pipelines, ensuring closed-loop stability with no model information beyond assuming open-loop stability. Numerical experiments show that MAD policies trained with deep deterministic policy gradient (DDPG) methods generalize to unseen scenarios - matching the performance of standard neural network policies while guaranteeing closed-loop stability by design.

Paper Structure

This paper contains 14 sections, 3 theorems, 25 equations, 2 figures, 1 table.

Key Result

Theorem 1

(adapted from furieri2024learningtoboost) Let $\mathbfcal{F} \in \mathcal{L}_p$, and define the control input as where $\mathbfcal{M}:\ell^n\rightarrow \ell^m$ is a causal operator. If $\mathbfcal{M} \in \mathcal{L}_p$, then the closed-loop system satisfies eq:closed_loop_stability. Conversely, if a causal policy $\mathbf{u}=\mathbf{K}(\mathbf{x})$ ensures eq:closed_loop_stability, then there exi

Figures (2)

  • Figure 1: Closed-loop trajectories after training with MAD policies. Initial conditions are marked with $\circ$. The colored balls (and their radii) represent the agents (and their size for collision avoidance). Black objects represent the obstacles.
  • Figure 2: Percentage improvement in control performance over the pre-stabilizing controller $\pi_b$ defined in \ref{['eq:pre_controller']}. Dotted lines represent the performance in each episode, while solid lines indicate the best-so-far performance for each policy class. To reduce visual clutter, episodic performance (dotted lines) is shown only for the MA and MAD policies. The inset plot displays the long-term best-so-far performance between episodes 2500 and 3000.

Theorems & Definitions (7)

  • Theorem 1
  • Definition 1: MAD policies
  • Remark 1
  • Theorem 2
  • proof
  • Proposition 1
  • proof