Table of Contents
Fetching ...

An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space

Hai Lin, Cheng Huang, Zhihong Chen

TL;DR

The paper tackles reinforcement learning in large action spaces by introducing an advantage-based optimization with the Advantage Branching Dueling Q-network (ABQ). ABQ decomposes actions into branches and uses a Baseline Module to tune branch advantages via the relation $Q(s,a) = V(s) + (A(s,a) - B)$, where $B = \max_{1 \le i \le n} \frac{1}{N} \sum_{j=1}^N A_i(s,a_{ij})$, improving coordination and learning efficiency. The method reduces the action evaluation burden to $n \times N$ while preserving global performance through the baseline adjustment, and is implemented as the ABQ network with a dueling architecture and per-branch Q-learning updates. Empirical results on Gym benchmarks (Pendulum, BipedalWalker, HalfCheetah, Ant, Humanoid) show ABQ outperforming BDQ and competing with DDPG/TD3, particularly as action-space dimensionality grows. This demonstrates ABQ’s potential for efficient, scalable RL in real-world, high-dimensional action settings.

Abstract

Reinforcement learning tasks in real-world scenarios often involve large, high-dimensional action spaces, leading to challenges such as convergence difficulties, instability, and high computational complexity. It is widely acknowledged that traditional value-based reinforcement learning algorithms struggle to address these issues effectively. A prevalent approach involves generating independent sub-actions within each dimension of the action space. However, this method introduces bias, hindering the learning of optimal policies. In this paper, we propose an advantage-based optimization method and an algorithm named Advantage Branching Dueling Q-network (ABQ). ABQ incorporates a baseline mechanism to tune the action value of each dimension, leveraging the advantage relationship across different sub-actions. With this approach, the learned policy can be optimized for each dimension. Empirical results demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% more cumulative rewards in HalfCheetah, Ant, and Humanoid environments, respectively. Furthermore, ABQ exhibits competitive performance when compared against two continuous action benchmark algorithms, DDPG and TD3.

An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space

TL;DR

The paper tackles reinforcement learning in large action spaces by introducing an advantage-based optimization with the Advantage Branching Dueling Q-network (ABQ). ABQ decomposes actions into branches and uses a Baseline Module to tune branch advantages via the relation , where , improving coordination and learning efficiency. The method reduces the action evaluation burden to while preserving global performance through the baseline adjustment, and is implemented as the ABQ network with a dueling architecture and per-branch Q-learning updates. Empirical results on Gym benchmarks (Pendulum, BipedalWalker, HalfCheetah, Ant, Humanoid) show ABQ outperforming BDQ and competing with DDPG/TD3, particularly as action-space dimensionality grows. This demonstrates ABQ’s potential for efficient, scalable RL in real-world, high-dimensional action settings.

Abstract

Reinforcement learning tasks in real-world scenarios often involve large, high-dimensional action spaces, leading to challenges such as convergence difficulties, instability, and high computational complexity. It is widely acknowledged that traditional value-based reinforcement learning algorithms struggle to address these issues effectively. A prevalent approach involves generating independent sub-actions within each dimension of the action space. However, this method introduces bias, hindering the learning of optimal policies. In this paper, we propose an advantage-based optimization method and an algorithm named Advantage Branching Dueling Q-network (ABQ). ABQ incorporates a baseline mechanism to tune the action value of each dimension, leveraging the advantage relationship across different sub-actions. With this approach, the learned policy can be optimized for each dimension. Empirical results demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% more cumulative rewards in HalfCheetah, Ant, and Humanoid environments, respectively. Furthermore, ABQ exhibits competitive performance when compared against two continuous action benchmark algorithms, DDPG and TD3.

Paper Structure

This paper contains 8 sections, 8 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example of Action Branching Architecture
  • Figure 2: Advantage Branching Dueling Q-Network
  • Figure 3: Performance analysis in the environment of Pendulum
  • Figure 4: Performance analysis in the environment of BipedalWalker
  • Figure 5: Performance analysis in the environment of HalfCheetah
  • ...and 2 more figures