Table of Contents
Fetching ...

Representative Action Selection for Large Action Space Bandit Families

Quan Zhou, Mark Kozdoba, Shie Mannor

TL;DR

The paper addresses the problem of efficiently learning across a family of bandits with a large shared action space by extracting a small, representative subset of actions. It introduces a simple sampling-based subset selection algorithm that builds a representative set without requiring explicit correlation knowledge, and provides regret bounds framed by $\epsilon$-nets and partitions of the action space. The analysis leverages Gaussian process (and RKHS) modeling of rewards and develops both geometric and measure-theoretic net concepts to bound regret, with the key sampling-correction term decaying exponentially in the sample budget $K$. Empirically, the method outperforms standard baselines such as Thompson Sampling and CUCB, demonstrates robustness to varying correlation structures, and remains scalable by avoiding exhaustive inner optimization over the full action space. The work offers a practical route to reducing exploration and computation in large action-space bandits while retaining performance across a family of related tasks.

Abstract

We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. Indeed, in many natural situations, while the nominal set of actions may be large, there also exist significant correlations between the rewards of different actions. In this paper we propose an algorithm that can significantly reduce the action space when such correlations are present, without the need to a-priori know the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Thompson Sampling and Upper Confidence Bound methods.

Representative Action Selection for Large Action Space Bandit Families

TL;DR

The paper addresses the problem of efficiently learning across a family of bandits with a large shared action space by extracting a small, representative subset of actions. It introduces a simple sampling-based subset selection algorithm that builds a representative set without requiring explicit correlation knowledge, and provides regret bounds framed by -nets and partitions of the action space. The analysis leverages Gaussian process (and RKHS) modeling of rewards and develops both geometric and measure-theoretic net concepts to bound regret, with the key sampling-correction term decaying exponentially in the sample budget . Empirically, the method outperforms standard baselines such as Thompson Sampling and CUCB, demonstrates robustness to varying correlation structures, and remains scalable by avoiding exhaustive inner optimization over the full action space. The work offers a practical route to reducing exploration and computation in large action-space bandits while retaining performance across a family of related tasks.

Abstract

We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. Indeed, in many natural situations, while the nominal set of actions may be large, there also exist significant correlations between the rewards of different actions. In this paper we propose an algorithm that can significantly reduce the action space when such correlations are present, without the need to a-priori know the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Thompson Sampling and Upper Confidence Bound methods.

Paper Structure

This paper contains 37 sections, 16 theorems, 106 equations, 5 figures, 1 algorithm.

Key Result

Lemma 3.3

Fix $\epsilon>0$. Let $\mathcal{R}$ be a partition of $\mathcal{A}_{\mathop{\mathrm{full}}\limits}$, and let $q$ be an importance measure given by equ:q-measure. Let $\mathcal{A}$ be the output of Algorithm alg:smallvertices after $K$ samples. Then, with probability at least $1-\frac{1}{\epsilon} \e

Figures (5)

  • Figure 1: Illustration of Theorem \ref{['thm:alg-bound-upper']}. The full action space (blue dots) lies on the unit sphere in $\mathbb{R}^3$ and can be partitioned into three clusters. The regret bound established in the theorem depends not on the diameter of the entire action space---indicated by the red dashed line---but rather on the diameters of the individual clusters, represented by the green dashed lines, each of which is significantly smaller.
  • Figure 2: Comparison on solving \ref{['equ:opt']} by selecting $K=5$ actions from 15 grid points in $[0,2]$, with outcome functions $f(a)=\mu_a$ sampled via an RBF kernel at varying length-scales. Left: Expected regret over 50 repetitions: our method with exhaustive search (green) vs. super-arm TS (yellow), UCB (orange), and Successive Halving (SH; blue). TS/UCB run for 3,000 rounds; SH uses a budget of 37,000 pulls. Right: SH’s expected regret (blue) and cumulative super-arm pulls (yellow bars) per round vs. our method (green). SH requires nearly 10,000 pulls—about three times the number of super-arms—to match our performance.
  • Figure 3: Comparison of EpsilonNet+TS (green), CTS (brown), and CUCB (red) on solving \ref{['equ:opt']} with $K=10$ actions from 500 grid points in $[-5,5]$, using outcome functions $f(a)=\mu_a$ sampled via an RBF kernel at varying length-scales. In EpsilonNet+TS, the argmax oracle in Algorithm \ref{['alg:smallvertices']} is replaced by TS; CTS and CUCB run for 3,000 rounds. All methods use a $\mathcal{N}(0,1)$ prior. Left: expected regret (30 repetitions). Right: runtime (purple, left axis) and arm pulls (gray, right axis).
  • Figure 4: Illustration of clustered action spaces on unit sphere in $\mathbb{R}^3$ and the effect of cluster diameters on regret. Five clusters are formed by generating 5 fixed center points, with 200 points sampled around each using Gaussian noise (spread controls the variance). Bandits are drawn from $\mathcal{N}(0, I)$. The left subplot shows the mean $\pm$ standard deviation of the expected regret (over 30 trials) as the spread varies from 0.01 to 0.5, using $10^4$ additional bandits. The middle and right subplots show example action spaces (blue dots) for spread values 0.01 and 0.5, with representative actions (purple stars) selected by Algorithm \ref{['alg:smallvertices']} with $K = 10$.
  • Figure 5: Experiments with outcome functions sampled from RBF/Gibbs kernels in \ref{['equ:kernel-define-app']}. Sampled outcome functions from Gibbs kernel over fixed 1000 grid points in $[0,2]$ (blue curves, right y-axis). The histogram (purple bars, left y-axis) shows action selection frequencies by Algorithm \ref{['alg:smallvertices']} with $K=5000$, favoring regions with rougher functions and edge points.

Theorems & Definitions (37)

  • Example 2.1
  • Definition 3.1
  • Definition 3.2
  • Lemma 3.3
  • Definition 4.1: $\epsilon$-reference sets
  • Theorem 4.2: Regret bounds of $\epsilon$-reference subsets
  • Theorem 4.4
  • Theorem 4.5
  • Remark 4.6: The Cluster Complexity Term
  • Remark 4.7: Sampling Correction Term
  • ...and 27 more