Table of Contents
Fetching ...

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

Yanran Wang, Qiuchen Qian, David Boyle

TL;DR

AWaVO reframes constrained RL as Wasserstein variational optimization by introducing Adaptive Generalized Sliced Wasserstein Distance (A-GSWD) and Optimality-Rectified Policy Optimization with Distributional Representation (ORPO-DR). It proves global convergence with rate $\Theta(1/\sqrt{T})$ under mild assumptions and demonstrates guaranteed interpretability via a formal pseudo/true metric and distributional policy representations. Empirically, AWaVO achieves competitive or superior performance-interpretability trade-offs on Acrobot, Cartpole, Walker, Drone, and real quadrotor tasks, while providing probabilistic interpretations of decisions through latent factor analysis. This work advances safe, explainable RL for safety-critical domains by integrating probabilistic inference, distributional representations, and formal interpretability.

Abstract

Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

TL;DR

AWaVO reframes constrained RL as Wasserstein variational optimization by introducing Adaptive Generalized Sliced Wasserstein Distance (A-GSWD) and Optimality-Rectified Policy Optimization with Distributional Representation (ORPO-DR). It proves global convergence with rate under mild assumptions and demonstrates guaranteed interpretability via a formal pseudo/true metric and distributional policy representations. Empirically, AWaVO achieves competitive or superior performance-interpretability trade-offs on Acrobot, Cartpole, Walker, Drone, and real quadrotor tasks, while providing probabilistic interpretations of decisions through latent factor analysis. This work advances safe, explainable RL for safety-critical domains by integrating probabilistic inference, distributional representations, and formal interpretability.

Abstract

Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
Paper Structure (28 sections, 7 theorems, 31 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 28 sections, 7 theorems, 31 equations, 9 figures, 3 tables, 2 algorithms.

Key Result

Proposition 5.1

(Pseudo-metric): Given two probability measures $\mu,\nu\in P_k(\mathcal{X})$ and a mapping $\alpha: \mathcal{X}\rightarrow\mathcal{R}_{\widetilde{\theta}}$, the adaptive slicing A-GSWD, defined in a_gswd_def, with order $k$ in the range $[1,\infty)$, is a pseudo-metric that satisfies non-negativity

Figures (9)

  • Figure 1: A new graphical model for constrained RL: refer to Algorithm \ref{['AWaVO']} for a comprehensive overview of (i) Parameter Identification, (ii) Policy Updating and (iii) Inference Execution.
  • Figure 2: The algorithmic framework of AWaVO. We reform constrained RL as a Wasserstein variational optimization setup, consisting of two primary submodules: ORPO-DR and WVI (Section \ref{['alg_sum']}).
  • Figure 3: Performance comparison over 10 seeds. CRPO and AWaVO outperform PaETS, with a trade-off highlighted: although PaETS offers probabilistic interpretation with Bayesian networks, its convergence is generally unstable. Our proposed AWaVO achieves a better balance between high performance and interpretability. In contrast to two other constrained RL algorithms, i.e., TRPO-IPO and PCPO, we observe an interesting result: PCPO performs better in tasks like Acrobot, Cartpole, and Walker, while TRPO-IPO outperforms PCPO in the more complex drone tasks (Figure \ref{['learning_cur_drone']}). Further, in \ref{['real_flight_tasks_conv']}, we will explore more complex real-world tasks using an aerial robot.
  • Figure 4: We use our AWaVO as the tracking controller for a quadrotor, where ORPO-DR is employed as the uncertainty estimator, and WVI using A-GSWD is leveraged as the controller.
  • Figure 5: Performance comparison in a real quadrotor: our AWaVO slightly outperforms the constrained RL approach, i.e., PCPO, whilst achieving interpretability in \ref{['Pro_interpret']}.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Definition 4.1
  • Proposition 5.1
  • Remark 5.2
  • Theorem 5.4
  • Theorem 5.5
  • Proposition 4.1
  • Proposition 4.2
  • Lemma 4.3
  • Lemma 4.4