Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies
Finn Rietz, Erik Schaffernicht, Stefan Heinrich, Johannes A. Stork
TL;DR
This work tackles safety and interpretability in reinforcement learning by introducing constrained normalizing flow policies (CNFP) that enforce instantaneous state-action constraints during training. CNFP achieves this by analytically constructing invertible constraint mappings that transform an unbounded action distribution into the per-state allowed region $\mathcal{A}_\varphi^\mathbf{s}$, integrating with Soft Actor-Critic through tractable log-densities derived from the change-of-variables formula. The approach yields a modular, interpretable policy where each flow step aligns with a specific constraint, enabling constraint satisfaction by construction and faster learning on a dense reward. Empirical results in a 2D navigation task show CNFP matches unconstrained performance while maintaining hard constraint satisfaction, outperforming reward-penalty and Lagrangian baselines and highlighting the benefit of interpretable, constraint-aware policy design. Future directions include extending to non-convex constraint sets and developing learnable or differentiable constraint mappings to broaden applicability in safety-critical RL.
Abstract
Reinforcement learning policies are typically represented by black-box neural networks, which are non-interpretable and not well-suited for safety-critical domains. To address both of these issues, we propose constrained normalizing flow policies as interpretable and safe-by-construction policy models. We achieve safety for reinforcement learning problems with instantaneous safety constraints, for which we can exploit domain knowledge by analytically constructing a normalizing flow that ensures constraint satisfaction. The normalizing flow corresponds to an interpretable sequence of transformations on action samples, each ensuring alignment with respect to a particular constraint. Our experiments reveal benefits beyond interpretability in an easier learning objective and maintained constraint satisfaction throughout the entire learning process. Our approach leverages constraints over reward engineering while offering enhanced interpretability, safety, and direct means of providing domain knowledge to the agent without relying on complex reward functions.
