MAD: A Magnitude And Direction Policy Parametrization for Stability Constrained Reinforcement Learning
Luca Furieri, Sucheth Shenoy, Danilo Saccani, Andrea Martin, Giancarlo Ferrari-Trecate
TL;DR
This work tackles stability guarantees in reinforcement learning for nonlinear, discrete-time systems by introducing magnitude-and-direction (MAD) policies. MAD decouples stability from expressivity by placing the stable, $\\ ext{L}_p$-bounded magnitude in a fixed operator while learning a flexible state-dependent direction, allowing integration with model-free RL. The authors show that, with model knowledge, MAD can realize all stabilizing controllers and strictly expands the set of achievable closed-loop behaviors beyond disturbance-feedback (DF) policies; under model uncertainty, MAD remains stabilizing under a quantitative robustness condition and retains state-feedback capabilities even when the disturbance reconstruction is unreliable. Empirical results on a corridor navigation task demonstrate that MAD policies generalize as well as standard neural policies, while guaranteeing closed-loop stability by design, highlighting practical benefits for safety-critical RL.
Abstract
We introduce magnitude and direction (MAD) policies, a policy parameterization for reinforcement learning (RL) that preserves Lp closed-loop stability for nonlinear dynamical systems. Despite their completeness in describing all stabilizing controllers, methods based on nonlinear Youla and system-level synthesis are significantly impacted by the difficulty of parametrizing Lp-stable operators. In contrast, MAD policies introduce explicit feedback on state-dependent features - a key element behind the success of reinforcement learning pipelines - without jeopardizing closed-loop stability. This is achieved by letting the magnitude of the control input be described by a disturbance-feedback Lp-stable operator, while selecting its direction based on state-dependent features through a universal function approximator. We further characterize the robust stability properties of MAD policies under model mismatch. Unlike existing disturbance-feedback policy parametrizations, MAD policies introduce state-feedback components compatible with model-free RL pipelines, ensuring closed-loop stability with no model information beyond assuming open-loop stability. Numerical experiments show that MAD policies trained with deep deterministic policy gradient (DDPG) methods generalize to unseen scenarios - matching the performance of standard neural network policies while guaranteeing closed-loop stability by design.
