Table of Contents
Fetching ...

BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces

Matthew Landers, Taylor W. Killian, Hugo Barnes, Thomas Hartvigsen, Afsaneh Doryab

TL;DR

BraVE tackles offline reinforcement learning in high-dimensional combinatorial action spaces by enforcing a tree-structured action space and guiding Q-value evaluation with a neural network, achieving a linear number of evaluations per decision while preserving sub-action dependencies. It introduces a behavior-regularized TD loss coupled with a recursive BraVE loss that propagates targets through the tree, along with a depth penalty and beam search to stabilize training and improve policy quality. The method is validated on the CoNE benchmark, demonstrating robustness to increasing action space size and sub-action dependencies and achieving up to 20× improvements over state-of-the-art baselines in large-action settings. BraVE also supports online fine-tuning and reveals insights about the tradeoffs between expressivity and tractability, offering a scalable approach to constrained RL and actionable guidance for future exploration in combinatorial action spaces with offline data.

Abstract

Offline reinforcement learning in high-dimensional, discrete action spaces is challenging due to the exponential scaling of the joint action space with the number of sub-actions and the complexity of modeling sub-action dependencies. Existing methods either exhaustively evaluate the action space, making them computationally infeasible, or factorize Q-values, failing to represent joint sub-action effects. We propose Branch Value Estimation (BraVE), a value-based method that uses tree-structured action traversal to evaluate a linear number of joint actions while preserving dependency structure. BraVE outperforms prior offline RL methods by up to $20\times$ in environments with over four million actions.

BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces

TL;DR

BraVE tackles offline reinforcement learning in high-dimensional combinatorial action spaces by enforcing a tree-structured action space and guiding Q-value evaluation with a neural network, achieving a linear number of evaluations per decision while preserving sub-action dependencies. It introduces a behavior-regularized TD loss coupled with a recursive BraVE loss that propagates targets through the tree, along with a depth penalty and beam search to stabilize training and improve policy quality. The method is validated on the CoNE benchmark, demonstrating robustness to increasing action space size and sub-action dependencies and achieving up to 20× improvements over state-of-the-art baselines in large-action settings. BraVE also supports online fine-tuning and reveals insights about the tradeoffs between expressivity and tractability, offering a scalable approach to constrained RL and actionable guidance for future exploration in combinatorial action spaces with offline data.

Abstract

Offline reinforcement learning in high-dimensional, discrete action spaces is challenging due to the exponential scaling of the joint action space with the number of sub-actions and the complexity of modeling sub-action dependencies. Existing methods either exhaustively evaluate the action space, making them computationally infeasible, or factorize Q-values, failing to represent joint sub-action effects. We propose Branch Value Estimation (BraVE), a value-based method that uses tree-structured action traversal to evaluate a linear number of joint actions while preserving dependency structure. BraVE outperforms prior offline RL methods by up to in environments with over four million actions.

Paper Structure

This paper contains 35 sections, 11 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: BraVE's tree representation for a 3-dimensional combinatorial action $\mathbf{a} = [a_1, a_2, a_3]$, where each sub-action $a_i \in \{0,1,2\}$. Each node encodes a complete action vector, with explicitly chosen sub-actions set according to the traversal path from the root, and all remaining dimensions filled with a default value (here, 0). At depth $k$, the value of sub-action $a_k$ is selected, with sibling nodes differing only in that dimension.
  • Figure 2: BraVE traversal in a 3-D binary action space (full tree shown bottom-right). Starting from the root $[0,0,0]$, the agent selects $\hat{a}'_1 = 1$ since its branch value ($11$) exceeds both those of alternative children ($4$, $-1$) and the root's Q-value ($8$). Traversal proceeds until reaching $[1,1,0]$, where the Q-value ($16$) exceeds the child's branch value ($1$); a terminal condition. Masked values ($-$) are ignored.
  • Figure 3: Example of loss propagation in a 4-D binary action space (full tree shown bottom-right). Starting from the node $[1,1,1,0]$ (bottom left), the target (Equation \ref{['eq:target']}) is propagated to its parent $[1,1,0,0]$. The new target is computed as the maximum of the propagated value, the parent's own Q-value, and the branch values of alternative child nodes. This process recurses up the tree to compute all node losses.
  • Figure 4: A 2-D grid with five pits and the true maximum Q-values in each state.
  • Figure 5: Learning curves for BraVE, FAS, and IQL in environments with non-factorizable reward structures but no sub-action dependencies.
  • ...and 9 more figures