Table of Contents
Fetching ...

Stepping Out of the Shadows: Reinforcement Learning in Shadow Mode

Philipp Gassert, Matthias Althoff

TL;DR

The usefulness of the novel approach to reinforcement learning is demonstrated for a reach-avoid task, for which it is able to effectively train an agent, where standard approaches fail, and improve the performance compared to only using conventional controllers or reinforcement learning.

Abstract

Reinforcement learning (RL) is not yet competitive for many cyber-physical systems, such as robotics, process automation, and power systems, as training on a system with physical components cannot be accelerated, and simulation models do not exist or suffer from a large simulation-to-reality gap. During the long training time, expensive equipment cannot be used and might even be damaged due to inappropriate actions of the reinforcement learning agent. Our novel approach addresses exactly this problem: We train the reinforcement agent in a so-called shadow mode with the assistance of an existing conventional controller, which does not have to be trained and instantaneously performs reasonably well. In shadow mode, the agent relies on the controller to provide action samples and guidance towards favourable states to learn the task, while simultaneously estimating for which states the learned agent will receive a higher reward than the conventional controller. The RL agent will then control the system for these states and all other regions remain under the control of the existing controller. Over time, the RL agent will take over for an increasing amount of states, while leaving control to the baseline, where it cannot surpass its performance. Thus, we keep regret during training low and improve the performance compared to only using conventional controllers or reinforcement learning. We present and evaluate two mechanisms for deciding whether to use the RL agent or the conventional controller. The usefulness of our approach is demonstrated for a reach-avoid task, for which we are able to effectively train an agent, where standard approaches fail.

Stepping Out of the Shadows: Reinforcement Learning in Shadow Mode

TL;DR

The usefulness of the novel approach to reinforcement learning is demonstrated for a reach-avoid task, for which it is able to effectively train an agent, where standard approaches fail, and improve the performance compared to only using conventional controllers or reinforcement learning.

Abstract

Reinforcement learning (RL) is not yet competitive for many cyber-physical systems, such as robotics, process automation, and power systems, as training on a system with physical components cannot be accelerated, and simulation models do not exist or suffer from a large simulation-to-reality gap. During the long training time, expensive equipment cannot be used and might even be damaged due to inappropriate actions of the reinforcement learning agent. Our novel approach addresses exactly this problem: We train the reinforcement agent in a so-called shadow mode with the assistance of an existing conventional controller, which does not have to be trained and instantaneously performs reasonably well. In shadow mode, the agent relies on the controller to provide action samples and guidance towards favourable states to learn the task, while simultaneously estimating for which states the learned agent will receive a higher reward than the conventional controller. The RL agent will then control the system for these states and all other regions remain under the control of the existing controller. Over time, the RL agent will take over for an increasing amount of states, while leaving control to the baseline, where it cannot surpass its performance. Thus, we keep regret during training low and improve the performance compared to only using conventional controllers or reinforcement learning. We present and evaluate two mechanisms for deciding whether to use the RL agent or the conventional controller. The usefulness of our approach is demonstrated for a reach-avoid task, for which we are able to effectively train an agent, where standard approaches fail.

Paper Structure

This paper contains 22 sections, 10 equations, 5 figures.

Figures (5)

  • Figure 1: Schematic of training in shadow mode. \ref{['fig:framework_flow']}: Working principle combining the baseline agent and the RL agent. Both are fed the state and reward information $s_t, r_t$ and choose actions $a^a_t$ according to $\pi^a$ and $a^b_t$ according to $\pi^b$. Additionally, the agent passes its decision on which action to choose $a_t^{decision}$ (see Sec. \ref{['subsubsec:agent_decision']}) or the $Q$-values of the two actions $Q(s_t, a^a_t)$ and $Q(s_t, a^b_t)$ for comparison (see Sec. \ref{['subsubsec:max_q_val']}). Only $a_t^c$ is executed on the real system. \ref{['fig:framework_example']}: Exemplary episode with combined agent in our reach-avoid environment. The agent (blue circle) is initialized at the top left position and has to reach the goal (green circle) while avoiding the obstacle (red bar). For the first four steps, the agent actions were chosen (red arrows), with the alternative baseline action represented by the blue dashed line arrows. For the remaining four actions, the baseline actions were chosen (blue arrows).
  • Figure 2: Reward and goal reaching rate for training with DDPG without shadow mode
  • Figure 3: Test Reward and goal reaching rate for training with DDPG in shadow mode, $\pi^c_{reg}$ regularized with regularization penalty \ref{['eq:reg-action']} with strength $\lambda = 2$, and control authority by agent. Both combined policies outperform the RL agent's policy $\pi^a$ that was not trained in shadow mode, even when it is trained with a dense reward and $\epsilon = 0.2$.
  • Figure 4: Reward (sparse) and goal reaching rate for training with DDPG in shadow mode and control authority by $Q$-value compared to standard training
  • Figure 5: The figure contains a heatmap of the decision criterion $Q(s_t,a_t^a)/Q(s_t,a_t^b)$ for all possible positions of the agent with the given obstalce (red bar) and the target (green dot, $\epsilon = 0.02$, with green circle added for visibility). The baseline is acting for all values smaller or equal to 1 and the agent in all other regions.