BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces
Matthew Landers, Taylor W. Killian, Hugo Barnes, Thomas Hartvigsen, Afsaneh Doryab
TL;DR
BraVE tackles offline reinforcement learning in high-dimensional combinatorial action spaces by enforcing a tree-structured action space and guiding Q-value evaluation with a neural network, achieving a linear number of evaluations per decision while preserving sub-action dependencies. It introduces a behavior-regularized TD loss coupled with a recursive BraVE loss that propagates targets through the tree, along with a depth penalty and beam search to stabilize training and improve policy quality. The method is validated on the CoNE benchmark, demonstrating robustness to increasing action space size and sub-action dependencies and achieving up to 20× improvements over state-of-the-art baselines in large-action settings. BraVE also supports online fine-tuning and reveals insights about the tradeoffs between expressivity and tractability, offering a scalable approach to constrained RL and actionable guidance for future exploration in combinatorial action spaces with offline data.
Abstract
Offline reinforcement learning in high-dimensional, discrete action spaces is challenging due to the exponential scaling of the joint action space with the number of sub-actions and the complexity of modeling sub-action dependencies. Existing methods either exhaustively evaluate the action space, making them computationally infeasible, or factorize Q-values, failing to represent joint sub-action effects. We propose Branch Value Estimation (BraVE), a value-based method that uses tree-structured action traversal to evaluate a linear number of joint actions while preserving dependency structure. BraVE outperforms prior offline RL methods by up to $20\times$ in environments with over four million actions.
