Representative Action Selection for Large Action Space: From Bandits to MDPs
Quan Zhou, Shie Mannor
TL;DR
The paper tackles the challenge of learning under an extremely large, shared action space by identifying a small, representative subset of actions that preserves near-optimal performance across a family of RL environments. It extends a meta-bandit framework to meta-MDPs, modeling state-action values with fixed feature mappings and (non-centered) sub-Gaussian processes to capture environment heterogeneity. The core ideas hinge on constructing measure-theoretic epsilon-nets and analyzing a reference action set via cluster regret and Gaussian-width bounds, with extensions to large state spaces through state abstraction and relaxed optimality conditions. Theoretical results quantify how the representative-action approach approximates full-space performance and illustrate practical viability through IID-action meta-bands, tabular MDPs, and a CartPole case study, highlighting gains in efficiency and stability. Overall, the work provides a principled, scalable framework for large-scale, structured RL problems where exhaustive action evaluation is intractable.
Abstract
We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments -- a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty.
