Table of Contents
Fetching ...

Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies

Finn Rietz, Erik Schaffernicht, Stefan Heinrich, Johannes A. Stork

TL;DR

This work tackles safety and interpretability in reinforcement learning by introducing constrained normalizing flow policies (CNFP) that enforce instantaneous state-action constraints during training. CNFP achieves this by analytically constructing invertible constraint mappings that transform an unbounded action distribution into the per-state allowed region $\mathcal{A}_\varphi^\mathbf{s}$, integrating with Soft Actor-Critic through tractable log-densities derived from the change-of-variables formula. The approach yields a modular, interpretable policy where each flow step aligns with a specific constraint, enabling constraint satisfaction by construction and faster learning on a dense reward. Empirical results in a 2D navigation task show CNFP matches unconstrained performance while maintaining hard constraint satisfaction, outperforming reward-penalty and Lagrangian baselines and highlighting the benefit of interpretable, constraint-aware policy design. Future directions include extending to non-convex constraint sets and developing learnable or differentiable constraint mappings to broaden applicability in safety-critical RL.

Abstract

Reinforcement learning policies are typically represented by black-box neural networks, which are non-interpretable and not well-suited for safety-critical domains. To address both of these issues, we propose constrained normalizing flow policies as interpretable and safe-by-construction policy models. We achieve safety for reinforcement learning problems with instantaneous safety constraints, for which we can exploit domain knowledge by analytically constructing a normalizing flow that ensures constraint satisfaction. The normalizing flow corresponds to an interpretable sequence of transformations on action samples, each ensuring alignment with respect to a particular constraint. Our experiments reveal benefits beyond interpretability in an easier learning objective and maintained constraint satisfaction throughout the entire learning process. Our approach leverages constraints over reward engineering while offering enhanced interpretability, safety, and direct means of providing domain knowledge to the agent without relying on complex reward functions.

Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies

TL;DR

This work tackles safety and interpretability in reinforcement learning by introducing constrained normalizing flow policies (CNFP) that enforce instantaneous state-action constraints during training. CNFP achieves this by analytically constructing invertible constraint mappings that transform an unbounded action distribution into the per-state allowed region , integrating with Soft Actor-Critic through tractable log-densities derived from the change-of-variables formula. The approach yields a modular, interpretable policy where each flow step aligns with a specific constraint, enabling constraint satisfaction by construction and faster learning on a dense reward. Empirical results in a 2D navigation task show CNFP matches unconstrained performance while maintaining hard constraint satisfaction, outperforming reward-penalty and Lagrangian baselines and highlighting the benefit of interpretable, constraint-aware policy design. Future directions include extending to non-convex constraint sets and developing learnable or differentiable constraint mappings to broaden applicability in safety-critical RL.

Abstract

Reinforcement learning policies are typically represented by black-box neural networks, which are non-interpretable and not well-suited for safety-critical domains. To address both of these issues, we propose constrained normalizing flow policies as interpretable and safe-by-construction policy models. We achieve safety for reinforcement learning problems with instantaneous safety constraints, for which we can exploit domain knowledge by analytically constructing a normalizing flow that ensures constraint satisfaction. The normalizing flow corresponds to an interpretable sequence of transformations on action samples, each ensuring alignment with respect to a particular constraint. Our experiments reveal benefits beyond interpretability in an easier learning objective and maintained constraint satisfaction throughout the entire learning process. Our approach leverages constraints over reward engineering while offering enhanced interpretability, safety, and direct means of providing domain knowledge to the agent without relying on complex reward functions.
Paper Structure (11 sections, 8 equations, 4 figures)

This paper contains 11 sections, 8 equations, 4 figures.

Figures (4)

  • Figure 1: Our interpretable normalizing flow policy. Left: Environment, the agent should reach the star while avoiding dangerous obstacles and walls. Middle: A single flow step maps the initially unbounded policy distribution into the region satisfying the constraint (magenta rectangle), action samples are plotted in red. Right: The final policy distribution has support only over the allowed region.
  • Figure 2: Invertible mapping functions. Exemplary constraint regions are drawn in orange . These functions map from the unbounded domain into the constraint region, i.e. $f() \to $. The inverse function maps from the constraint region back to the unbounded domain, i.e. $f^{-1}() \to = $.
  • Figure 3: Two normalizing flows. Top: Permissive constraints, no obstacles are in close proximity and the battery is fully charged. Bottom: Restrictive constraints, the agent is close to an obstacle and its battery is almost empty. The green rectangle indicates a charging station.
  • Figure 4: Baseline comparison in a constrained 2D point navigation environment. Left: Our agent (CNFP) learns the task as quickly as the unconstrained agent since it optimizes the same, smooth and dense reward function while benefiting from a reduced search space. Right: Unlike other baselines, our agent maintains quasi-perfect constraint satisfaction throughout learning. The experiment was repeated three times with varying random seeds, the shaded area corresponds to one standard deviation around the mean.