Table of Contents
Fetching ...

Safe Exploration by Solving Early Terminated MDP

Hao Sun, Ziping Xu, Meng Fang, Zhenghao Peng, Jiadong Guo, Bo Dai, Bolei Zhou

TL;DR

This work addresses safe exploration in reinforcement learning by reframing constrained MDPs as Early Terminated MDPs (ET-MDPs), which terminate episodes upon constraint violations. It proposes an off-policy Context TD3 solver that uses context representations to overcome limited state visitation in ET-MDPs, enabling efficient learning. The key theoretical result shows that for sufficiently small termination reward $r_e$, the ET-MDP's optimal value $V^*_{ET}$ matches the CMDP's $V^*_c$, ensuring safety-preserving policies. Empirically, Context TD3 on ET-MDP achieves higher sample efficiency and better asymptotic performance with lower constraint violations across tight and budgeted CMDP benchmarks.

Abstract

Safe exploration is crucial for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential dangers, human tends to stop immediately and rarely learns to behave safely in danger. Motivated by human learning, we introduce a new approach to address safe RL problems under the framework of Early Terminated MDP (ET-MDP). We first define the ET-MDP as an unconstrained MDP with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better asymptotic performance and improved learning efficiency. Experiments on various CMDP tasks show a substantial improvement over previous methods that directly solve CMDP.

Safe Exploration by Solving Early Terminated MDP

TL;DR

This work addresses safe exploration in reinforcement learning by reframing constrained MDPs as Early Terminated MDPs (ET-MDPs), which terminate episodes upon constraint violations. It proposes an off-policy Context TD3 solver that uses context representations to overcome limited state visitation in ET-MDPs, enabling efficient learning. The key theoretical result shows that for sufficiently small termination reward , the ET-MDP's optimal value matches the CMDP's , ensuring safety-preserving policies. Empirically, Context TD3 on ET-MDP achieves higher sample efficiency and better asymptotic performance with lower constraint violations across tight and budgeted CMDP benchmarks.

Abstract

Safe exploration is crucial for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential dangers, human tends to stop immediately and rarely learns to behave safely in danger. Motivated by human learning, we introduce a new approach to address safe RL problems under the framework of Early Terminated MDP (ET-MDP). We first define the ET-MDP as an unconstrained MDP with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better asymptotic performance and improved learning efficiency. Experiments on various CMDP tasks show a substantial improvement over previous methods that directly solve CMDP.

Paper Structure

This paper contains 43 sections, 3 theorems, 15 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

For sufficient small $r_e$, the optimal policy of ET-MDP coincidences with $\pi^*$ of the original CMDP. (Proof is given by Appendix)

Figures (13)

  • Figure 1: The difference in state visitation frequency of MDPs and ET-MDPs in a diagnostic 2-D navigation environment. Left: the environment, an agent starts from the central red point in each episode and yellow lines denote lava, i.e., danger zone; Middle: the state visitation frequency of a random agent in a MDP by ignoring the lava; Right: the state visitation frequency of a random agent with lava in a ET-MDP. The limited state visitation in ET-MDP is the major challenge for existing RL algorithms.
  • Figure 2: Examples of the tested environments: The first three figures show the diagnostic 2D-Nav tasks with different constraint level; the following three figures show the budget tasks where agents control a point or a car to collect reward without hitting cost regions too many times; the last three figures show loose-constrained tasks where agents need to learn to move forward without falling.
  • Figure 3: Results on the three budget tasks. The first three columns show the rewards and the costs of different methods on the three environments respectively, while the last column shows the performance comparison between learning with extended state space and tightened approximation. As discussed in Sec. \ref{['sec_tightened_appx']}
  • Figure 4: Experiment Results on the diagnostic 2D navigation environment.
  • Figure 5: The first three figures show learning curves of TD3 and Context TD3 with/without early termination trick in three MuJoCo locomotion tasks; The last figure shows that context model can remarkably improve learning efficiency when the state visitation is limited.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Proposition 1
  • Theorem 1: Theorem 3 in wen2013efficient
  • Corollary 1
  • Remark 1: ET-MDPs reduce sample complexity
  • proof