Table of Contents
Fetching ...

Deep Reinforcement Learning in Parameterized Action Space

Matthew Hausknecht, Peter Stone

TL;DR

This work extends the Deep Deterministic Policy Gradients (DDPG) framework to parameterized action spaces by introducing a method for bounding action-space gradients, enabling stable learning in bounded continuous actions. Using the RoboCup 2D Half Field Offense domain, the authors train an actor-critic network to simultaneously select discrete action types (Dash, Turn, Tackle, Kick) and their continuous parameters, achieving goal-directed behaviors from scratch. Empirical results show that inverting-gradient gradient-bounding strategies yield robust learning, with multiple agents learning to approach the ball, kick toward the goal, and score—several surpassing a strong hand-coded Helios champion and outperforming a SARSA baseline. The findings demonstrate the viability and practicality of deep reinforcement learning in parameterized action spaces and provide a generally applicable technique for bounded continuous actions beyond RoboCup.

Abstract

Recent work has shown that deep neural networks are capable of approximating both value functions and policies in reinforcement learning domains featuring continuous state and action spaces. However, to the best of our knowledge no previous work has succeeded at using deep neural networks in structured (parameterized) continuous action spaces. To fill this gap, this paper focuses on learning within the domain of simulated RoboCup soccer, which features a small set of discrete action types, each of which is parameterized with continuous variables. The best learned agent can score goals more reliably than the 2012 RoboCup champion agent. As such, this paper represents a successful extension of deep reinforcement learning to the class of parameterized action space MDPs.

Deep Reinforcement Learning in Parameterized Action Space

TL;DR

This work extends the Deep Deterministic Policy Gradients (DDPG) framework to parameterized action spaces by introducing a method for bounding action-space gradients, enabling stable learning in bounded continuous actions. Using the RoboCup 2D Half Field Offense domain, the authors train an actor-critic network to simultaneously select discrete action types (Dash, Turn, Tackle, Kick) and their continuous parameters, achieving goal-directed behaviors from scratch. Empirical results show that inverting-gradient gradient-bounding strategies yield robust learning, with multiple agents learning to approach the ball, kick toward the goal, and score—several surpassing a strong hand-coded Helios champion and outperforming a SARSA baseline. The findings demonstrate the viability and practicality of deep reinforcement learning in parameterized action spaces and provide a generally applicable technique for bounded continuous actions beyond RoboCup.

Abstract

Recent work has shown that deep neural networks are capable of approximating both value functions and policies in reinforcement learning domains featuring continuous state and action spaces. However, to the best of our knowledge no previous work has succeeded at using deep neural networks in structured (parameterized) continuous action spaces. To fill this gap, this paper focuses on learning within the domain of simulated RoboCup soccer, which features a small set of discrete action types, each of which is parameterized with continuous variables. The best learned agent can score goals more reliably than the 2012 RoboCup champion agent. As such, this paper represents a successful extension of deep reinforcement learning to the class of parameterized action space MDPs.

Paper Structure

This paper contains 16 sections, 11 equations, 4 figures.

Figures (4)

  • Figure 1: Left: HFO State Representation uses a low-level, egocentric viewpoint providing features such as distances and angles to objects of interest like the ball, goal posts, corners of the field, and opponents. Right: Helios handcoded policy scores on a goalie. This 2012 champion agent forms a natural (albeit difficult) baseline of comparison.
  • Figure 2: Actor-Critic architecture (left): actor and critic networks may be interlinked, allowing activations to flow forwards from the actor to the critic and gradients to flow backwards from the critic to the actor. The gradients coming from the critic indicate directions of improvement in the continuous action space and are used to train the actor network without explicit targets. Actor Update (right): Backwards pass generates critic gradients $\nabla_{a}Q(s,a|\theta^Q)$ w.r.t. the action. These gradients are back-propagated through the actor resulting in gradients w.r.t. parameters $\nabla_{\theta^\mu}$ which are used to update the actor. Critic gradients w.r.t. parameters $\nabla_{\theta^Q}$ are ignored during the actor update.
  • Figure 3: Analysis of gradient bounding strategies: The left/middle/right columns respectively correspond to the inverting/zeroing/squashing gradients approaches to handling bounded continuous actions. First row depicts learning curves showing overall task performance: Only the inverting gradient approach succeeds in learning the soccer task. Second row shows average Q-Values produced by the critic throughout the entire learning process: Inverting gradient approach shows smoothly increasing Q-Values. The zeroing approach shows astronomically high Q-Values indicating instability in the critic. The squashing approach shows stable Q-Values that accurately reflect the actor's performance. Third row shows the average loss experienced during a critic update (Equation \ref{['eqn:stable_critic_update']}): As more reward is experienced critic loss is expected to rise as past actions are seen as increasingly sub-optimal. Inverting gradients shows growing critic loss with outliers accounting for the rapid increase nearing the right edge of the graph. Zeroing gradients approach shows unstably large loss. Squashing gradients never discovers much reward and loss stays near zero.
  • Figure 4: Left: Scatter plot of learning curves of DDPG-agents with Lowess curve. Three distinct phases of learning may be seen: the agents first get small rewards for approaching the ball (episode 1500), then learn to kick the ball towards the goal (episodes 2,000 - 8,000), and start scoring goals around episode 10,000. Right: DDPG-agents score nearly as reliably as expert baseline, but take longer to do so. A video of DDPG$_1$'s policy may be viewed at https://youtu.be/Ln0Cl-jE_40.