Table of Contents
Fetching ...

Representative Action Selection for Large Action Space: From Bandits to MDPs

Quan Zhou, Shie Mannor

TL;DR

The paper tackles the challenge of learning under an extremely large, shared action space by identifying a small, representative subset of actions that preserves near-optimal performance across a family of RL environments. It extends a meta-bandit framework to meta-MDPs, modeling state-action values with fixed feature mappings and (non-centered) sub-Gaussian processes to capture environment heterogeneity. The core ideas hinge on constructing measure-theoretic epsilon-nets and analyzing a reference action set via cluster regret and Gaussian-width bounds, with extensions to large state spaces through state abstraction and relaxed optimality conditions. Theoretical results quantify how the representative-action approach approximates full-space performance and illustrate practical viability through IID-action meta-bands, tabular MDPs, and a CartPole case study, highlighting gains in efficiency and stability. Overall, the work provides a principled, scalable framework for large-scale, structured RL problems where exhaustive action evaluation is intractable.

Abstract

We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments -- a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty.

Representative Action Selection for Large Action Space: From Bandits to MDPs

TL;DR

The paper tackles the challenge of learning under an extremely large, shared action space by identifying a small, representative subset of actions that preserves near-optimal performance across a family of RL environments. It extends a meta-bandit framework to meta-MDPs, modeling state-action values with fixed feature mappings and (non-centered) sub-Gaussian processes to capture environment heterogeneity. The core ideas hinge on constructing measure-theoretic epsilon-nets and analyzing a reference action set via cluster regret and Gaussian-width bounds, with extensions to large state spaces through state abstraction and relaxed optimality conditions. Theoretical results quantify how the representative-action approach approximates full-space performance and illustrate practical viability through IID-action meta-bands, tabular MDPs, and a CartPole case study, highlighting gains in efficiency and stability. Overall, the work provides a principled, scalable framework for large-scale, structured RL problems where exhaustive action evaluation is intractable.

Abstract

We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments -- a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty.

Paper Structure

This paper contains 52 sections, 17 theorems, 136 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Lemma 4.1

Given an initial state $x_0$ and a policy $\pi$, the difference in value functions from the optimal policy $\pi^*$ satisfies

Figures (3)

  • Figure 1: Comparison between the exact expected regret of Algorithm \ref{['alg:valuefunction']} (the solid curve) and the theoretical upper bounds (dashed curves) from Theorem \ref{['thm:alg-bound-upper']}, for $n = 10$ actions (iid from $\mathcal{N}(0,1)$) and different numbers of clusters $m \in {2, 5, 10}$. The bounds and exact values are plotted as functions of the sample sizes $K = 1, 2, \ldots, 50$.
  • Figure 2: Comparison of the exact expected performance loss (left) with the upper bound from Theorem \ref{['thm:alg-bound-upper']} (right) for different sample sizes $K$ and concentration parameters $\rho$. The figure illustrates how the performance loss and bound evolve as $\rho$ varies: when either $\rho$ or $1-\rho$ is small, the loss decreases quickly initially but approaches zero more slowly, whereas for nearly uniform $\rho$, the decrease is slower at first but reaches zero faster.
  • Figure 3: Performance comparison between full action space (501 actions) and action sets (constructed via Algorithm \ref{['alg:valuefunction']}) on a family of CartPole environments. Left: Training runtime for policies using available actions. Right: Total reward earned by deployed policies from fixed initial state. Results show mean $\pm$ one standard deviation across 30 repetitions for $K = [3,5,8,10,12,15,20]$.

Theorems & Definitions (38)

  • Example 2.1
  • Lemma 4.1: Performance Difference Lemma kakade2002approximately
  • Lemma 4.2
  • proof
  • Definition 4.1: Reference Action Set
  • Lemma 4.3
  • Theorem 5.1
  • Lemma 5.1: Borell-TIS inequality
  • Lemma 5.2: Expectation integral identity
  • proof : Proof of Theorem \ref{['thm:geometric-bounds']}
  • ...and 28 more