Table of Contents
Fetching ...

Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking

Roland Stolz, Hanna Krasowski, Jakob Thumm, Michael Eichelbeck, Philipp Gassert, Matthias Althoff

TL;DR

This paper proposes to focus learning on the set of relevant actions and introduces three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions, enhancing the predictability of the RL agent and enabling its use in safety-critical applications.

Abstract

Continuous action spaces in reinforcement learning (RL) are commonly defined as multidimensional intervals. While intervals usually reflect the action boundaries for tasks well, they can be challenging for learning because the typically large global action space leads to frequent exploration of irrelevant actions. Yet, little task knowledge can be sufficient to identify significantly smaller state-specific sets of relevant actions. Focusing learning on these relevant actions can significantly improve training efficiency and effectiveness. In this paper, we propose to focus learning on the set of relevant actions and introduce three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions. Thus, our methods ensure that only relevant actions are executed, enhancing the predictability of the RL agent and enabling its use in safety-critical applications. We further derive the implications of the proposed methods on the policy gradient. Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.

Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking

TL;DR

This paper proposes to focus learning on the set of relevant actions and introduces three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions, enhancing the predictability of the RL agent and enabling its use in safety-critical applications.

Abstract

Continuous action spaces in reinforcement learning (RL) are commonly defined as multidimensional intervals. While intervals usually reflect the action boundaries for tasks well, they can be challenging for learning because the typically large global action space leads to frequent exploration of irrelevant actions. Yet, little task knowledge can be sufficient to identify significantly smaller state-specific sets of relevant actions. Focusing learning on these relevant actions can significantly improve training efficiency and effectiveness. In this paper, we propose to focus learning on the set of relevant actions and introduce three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions. Thus, our methods ensure that only relevant actions are executed, enhancing the predictability of the RL agent and enabling its use in safety-critical applications. We further derive the implications of the proposed methods on the policy gradient. Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.
Paper Structure (31 sections, 6 theorems, 40 equations, 5 figures, 7 tables)

This paper contains 31 sections, 6 theorems, 40 equations, 5 figures, 7 tables.

Key Result

Proposition 1

Policy gradient for the ray mask.

Figures (5)

  • Figure 1: Illustration of masking methods in action space $\mathcal{A}$ with a hexagon-shaped relevant action set $\mathcal{A}^r$. The ray mask radially maps the actions towards the center of the relevant action set. The generator mask employs the latent action space $\mathcal{A}^l$, which is the generator space of the zonotope modeling the relevant action set. The distributional mask augments the policy probability density function so that it is zero outside the relevant action set.
  • Figure 2: The Seeker Reach-Avoid environment with state and action space. The agent (black) has to reach the goal (gray) while avoiding the obstacle (red). The center of the action space is illustrated by a cross and the relevant action set $\mathcal{A}^r$ for the current state is shown in green. The state set reachable at the next time step, by the relevant action set, is $\mathcal{S}_{\Delta t}$.
  • Figure 3: Average reward curves for benchmarks with transparent bootstrapped 95% confidence interval.
  • Figure 4: Average reward curves for Walker2D with transparent bootstrapped 95% confidence interval including standard PPO as additional baseline.
  • Figure 5: Qualitative deployment results for ten initial states and one goal-obstacle configuration for the Seeker environment. The top half shows ten trajectories with randomly sampled starting states. The lower half depicts the relevant action set (green polygon) for each time step along one trajectory.

Theorems & Definitions (12)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Lemma 1
  • proof
  • ...and 2 more