Probabilistic Constrained Reinforcement Learning with Formal Interpretability

Yanran Wang; Qiuchen Qian; David Boyle

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

Yanran Wang, Qiuchen Qian, David Boyle

TL;DR

AWaVO reframes constrained RL as Wasserstein variational optimization by introducing Adaptive Generalized Sliced Wasserstein Distance (A-GSWD) and Optimality-Rectified Policy Optimization with Distributional Representation (ORPO-DR). It proves global convergence with rate $\Theta(1/\sqrt{T})$ under mild assumptions and demonstrates guaranteed interpretability via a formal pseudo/true metric and distributional policy representations. Empirically, AWaVO achieves competitive or superior performance-interpretability trade-offs on Acrobot, Cartpole, Walker, Drone, and real quadrotor tasks, while providing probabilistic interpretations of decisions through latent factor analysis. This work advances safe, explainable RL for safety-critical domains by integrating probabilistic inference, distributional representations, and formal interpretability.

Abstract

Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

TL;DR

under mild assumptions and demonstrates guaranteed interpretability via a formal pseudo/true metric and distributional policy representations. Empirically, AWaVO achieves competitive or superior performance-interpretability trade-offs on Acrobot, Cartpole, Walker, Drone, and real quadrotor tasks, while providing probabilistic interpretations of decisions through latent factor analysis. This work advances safe, explainable RL for safety-critical domains by integrating probabilistic inference, distributional representations, and formal interpretability.

Abstract

Paper Structure (28 sections, 7 theorems, 31 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 28 sections, 7 theorems, 31 equations, 9 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Problem Formulation and Preliminaries
Method: Adaptive Sliced Wasserstein Variational Optimization (AWaVO)
WVI: Wasserstein Variational Inference
ORPO-DR: Optimality-Rectified Policy Optimization using Distributional Representation
Formal Methods for Interpretability
Experiments
Limitation
Conclusion
Notation Table
Background on Wasserstein Distance
Sliced Wasserstein Distance
Generalized Sliced Wasserstein Distance
Background on Distributional Representation in Bellman Equation and Temporal Difference Learning
...and 13 more sections

Key Result

Proposition 5.1

(Pseudo-metric): Given two probability measures $\mu,\nu\in P_k(\mathcal{X})$ and a mapping $\alpha: \mathcal{X}\rightarrow\mathcal{R}_{\widetilde{\theta}}$, the adaptive slicing A-GSWD, defined in a_gswd_def, with order $k$ in the range $[1,\infty)$, is a pseudo-metric that satisfies non-negativity

Figures (9)

Figure 1: A new graphical model for constrained RL: refer to Algorithm \ref{['AWaVO']} for a comprehensive overview of (i) Parameter Identification, (ii) Policy Updating and (iii) Inference Execution.
Figure 2: The algorithmic framework of AWaVO. We reform constrained RL as a Wasserstein variational optimization setup, consisting of two primary submodules: ORPO-DR and WVI (Section \ref{['alg_sum']}).
Figure 3: Performance comparison over 10 seeds. CRPO and AWaVO outperform PaETS, with a trade-off highlighted: although PaETS offers probabilistic interpretation with Bayesian networks, its convergence is generally unstable. Our proposed AWaVO achieves a better balance between high performance and interpretability. In contrast to two other constrained RL algorithms, i.e., TRPO-IPO and PCPO, we observe an interesting result: PCPO performs better in tasks like Acrobot, Cartpole, and Walker, while TRPO-IPO outperforms PCPO in the more complex drone tasks (Figure \ref{['learning_cur_drone']}). Further, in \ref{['real_flight_tasks_conv']}, we will explore more complex real-world tasks using an aerial robot.
Figure 4: We use our AWaVO as the tracking controller for a quadrotor, where ORPO-DR is employed as the uncertainty estimator, and WVI using A-GSWD is leveraged as the controller.
Figure 5: Performance comparison in a real quadrotor: our AWaVO slightly outperforms the constrained RL approach, i.e., PCPO, whilst achieving interpretability in \ref{['Pro_interpret']}.
...and 4 more figures

Theorems & Definitions (9)

Definition 4.1
Proposition 5.1
Remark 5.2
Theorem 5.4
Theorem 5.5
Proposition 4.1
Proposition 4.2
Lemma 4.3
Lemma 4.4

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

TL;DR

Abstract

Probabilistic Constrained Reinforcement Learning with Formal Interpretability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)