An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space
Hai Lin, Cheng Huang, Zhihong Chen
TL;DR
The paper tackles reinforcement learning in large action spaces by introducing an advantage-based optimization with the Advantage Branching Dueling Q-network (ABQ). ABQ decomposes actions into branches and uses a Baseline Module to tune branch advantages via the relation $Q(s,a) = V(s) + (A(s,a) - B)$, where $B = \max_{1 \le i \le n} \frac{1}{N} \sum_{j=1}^N A_i(s,a_{ij})$, improving coordination and learning efficiency. The method reduces the action evaluation burden to $n \times N$ while preserving global performance through the baseline adjustment, and is implemented as the ABQ network with a dueling architecture and per-branch Q-learning updates. Empirical results on Gym benchmarks (Pendulum, BipedalWalker, HalfCheetah, Ant, Humanoid) show ABQ outperforming BDQ and competing with DDPG/TD3, particularly as action-space dimensionality grows. This demonstrates ABQ’s potential for efficient, scalable RL in real-world, high-dimensional action settings.
Abstract
Reinforcement learning tasks in real-world scenarios often involve large, high-dimensional action spaces, leading to challenges such as convergence difficulties, instability, and high computational complexity. It is widely acknowledged that traditional value-based reinforcement learning algorithms struggle to address these issues effectively. A prevalent approach involves generating independent sub-actions within each dimension of the action space. However, this method introduces bias, hindering the learning of optimal policies. In this paper, we propose an advantage-based optimization method and an algorithm named Advantage Branching Dueling Q-network (ABQ). ABQ incorporates a baseline mechanism to tune the action value of each dimension, leveraging the advantage relationship across different sub-actions. With this approach, the learned policy can be optimized for each dimension. Empirical results demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% more cumulative rewards in HalfCheetah, Ant, and Humanoid environments, respectively. Furthermore, ABQ exhibits competitive performance when compared against two continuous action benchmark algorithms, DDPG and TD3.
