Table of Contents
Fetching ...

Reinforcement learning with combinatorial actions for coupled restless bandits

Lily Xu, Bryan Wilder, Elias B. Khalil, Milind Tambe

TL;DR

This work tackles RL planning when actions are inherently combinatorial and coupled across arms, a setting difficult for traditional RL. It introduces SEQUOIA, a framework that learns a Q-function via deep Q-learning and uses an MILP to optimally select feasible combinatorial actions at each timestep by embedding the trained Q-network. The paper formalizes four novel coRMAB problem instances (multiple interventions, bipartite scheduling, capacity constraints, and path constraints) and demonstrates substantial performance gains over myopic and heuristic baselines across challenging instances. Computational challenges are addressed through warm-starting, variance reduction, and MILP-based action selection, enabling effective long-horizon planning in complex action spaces. The approach broadens RL applicability to stochastic planning problems with per-step combinatorial actions and offers a path toward integrating RL with AI planning frameworks, albeit with notable MILP-solving overhead that motivates future efficiency improvements.

Abstract

Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. We introduce coRMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods -- which cannot address sequential planning and combinatorial selection simultaneously -- by an average of 24.8\% on these difficult instances.

Reinforcement learning with combinatorial actions for coupled restless bandits

TL;DR

This work tackles RL planning when actions are inherently combinatorial and coupled across arms, a setting difficult for traditional RL. It introduces SEQUOIA, a framework that learns a Q-function via deep Q-learning and uses an MILP to optimally select feasible combinatorial actions at each timestep by embedding the trained Q-network. The paper formalizes four novel coRMAB problem instances (multiple interventions, bipartite scheduling, capacity constraints, and path constraints) and demonstrates substantial performance gains over myopic and heuristic baselines across challenging instances. Computational challenges are addressed through warm-starting, variance reduction, and MILP-based action selection, enabling effective long-horizon planning in complex action spaces. The approach broadens RL applicability to stochastic planning problems with per-step combinatorial actions and offers a path toward integrating RL with AI planning frameworks, albeit with notable MILP-solving overhead that motivates future efficiency improvements.

Abstract

Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. We introduce coRMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods -- which cannot address sequential planning and combinatorial selection simultaneously -- by an average of 24.8\% on these difficult instances.

Paper Structure

This paper contains 48 sections, 9 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Restless bandits are a variant of planning over Markov decision processes, where each "arm" transitions depending on whether it is acted upon. Standard restless bandits can be solved with threshold-based policies, but these approaches are unable to address challenging settings with combinatorial constraints on the actions that cannot be decoupled, prohibiting the application of easy heuristic solutions. Our paper considers this class of strongly coupled restless bandits (coRMAB). We describe these new problem formulations (A--D) for restless bandits in detail in \ref{['sec:settings']}.
  • Figure 2: An overview of our SEQUOIA algorithm. Standard DQN takes as input the state and outputs estimated Q-values for actions, which are assumed to be easily enumerable. In contrast, we consider cases when the actions are too large to be enumerated due to their combinatorial constraint structure. Part 1. We therefore train a Q-network where the action $\bm{a}$ is included as an input (\ref{['alg:algorithm']}; described in \ref{['sec:q_learning']}). Part 2. We then embed that Q-network into a mixed-integer program, which also specifies the combinatorial action constraints (e.g., the formulations provided in \ref{['sec:settings']}). In evaluating the objective, the MILP solver conducts a forward pass through the neural network in order to calculate the expected Q-value. Solving the MILP thus finds an action $\bm{a}$ (the decision variables) that maximizes the predicted Q-value $Q(\bm{s}, \bm{a})$ (our objective).
  • Figure 3: Across all problem settings, SEQUOIA achieves consistently better performance compared to existing methods, which do not consider both combinatorial selection and sequential planning. We evaluate with $J=\{20, 40, 100\}$ arms and $N = \{5, 10, 20\}$ workers. The $y$-axis depicts the average per-timestep reward, normalized to the reward achieved by the Random baseline such that $R_{\textsc{Random}} = 1$. For the path-constrained problem, note that there are no Iterative Myopic or Iterative DQN baselines, as there is no simple iterative approach for selecting a valid cycle.
  • Figure 4: Even in a simple two-arm problem setting with budget $B=1$, a myopic policy can lead to arbitrarily poor performance in restless bandits. Each arm has three states, with positive reward in each state. Suppose that the probability $p$ of transitioning right one state is $p=1$ when the arm is acted on and $p=0$ otherwise. This problem instance leads to the rewards on the right, where the gap between myopic and SEQUOIA can be arbitrarily large depending on the rewards at the rightmost state. This poor performance results even in this simple problem setting.
  • Figure 5: Graph used for the path-constrained problem, based on a diagram of the London underground. Blue represents nodes used for $J=20$. Green nodes are added for $J=40$, and orange nodes are added for $J=100$.
  • ...and 1 more figures