Table of Contents
Fetching ...

Combinatorial Rising Bandit

Seockbean Song, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

TL;DR

CRB models combinatorial online learning where pulling base arms increases future rewards for all super arms that include them. CRUCB combines a Future-UCB per-base-arm index with a solver to pick the best super arm, and is shown to achieve regret bounds that are near-tight by matching lower bounds for CRB. Theoretical results characterize the optimality landscape, showing constant policies may be near-optimal under additive reward, while general CRB requires more nuanced strategies. Empirical results across synthetic shortest-path tasks and AntMaze deep RL demonstrate that CRUCB outperforms baselines, highlighting its practical relevance for rising-reward, structured decision problems with overlapping actions.

Abstract

Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards. It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems. In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots enhancing proficiency through practice and social influence strengthening in the history of successful recommendations. Moreover, the enhancement of a single base arm may affect multiple super arms that include it, introducing complex dependencies that are not captured by existing rising bandit models. To address this, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB). We establish an upper bound on regret CRUCB and show that it is nearly tight by deriving a matching lower bound. In addition, we empirically demonstrate the effectiveness of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.

Combinatorial Rising Bandit

TL;DR

CRB models combinatorial online learning where pulling base arms increases future rewards for all super arms that include them. CRUCB combines a Future-UCB per-base-arm index with a solver to pick the best super arm, and is shown to achieve regret bounds that are near-tight by matching lower bounds for CRB. Theoretical results characterize the optimality landscape, showing constant policies may be near-optimal under additive reward, while general CRB requires more nuanced strategies. Empirical results across synthetic shortest-path tasks and AntMaze deep RL demonstrate that CRUCB outperforms baselines, highlighting its practical relevance for rising-reward, structured decision problems with overlapping actions.

Abstract

Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards. It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems. In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots enhancing proficiency through practice and social influence strengthening in the history of successful recommendations. Moreover, the enhancement of a single base arm may affect multiple super arms that include it, introducing complex dependencies that are not captured by existing rising bandit models. To address this, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB). We establish an upper bound on regret CRUCB and show that it is nearly tight by deriving a matching lower bound. In addition, we empirically demonstrate the effectiveness of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.

Paper Structure

This paper contains 48 sections, 12 theorems, 81 equations, 16 figures, 2 tables, 6 algorithms.

Key Result

Theorem 1

Under Assumption asu:rising & asu:monotone, there exists an instance of CRB in which $\pi^*_{\text{const}}$ is not optimal.

Figures (16)

  • Figure 1: Toy example for online shortest path planning. (a) Graph: two paths from $s$ to $g$, an early peaker path ({shared edge, early peaker}) and a late-bloomer path ({shared edge, late bloomer}). (b) Outcome functions: a shared edge rises slowly; early peaker starts high but flattens; a late bloomer starts low but rises quickly, eventually surpassing the early peaker, so the late bloomer path is optimal for long horizon $T$. The reward is the sum of the outcomes of the base arms. (c) Cumulative regret under three algorithms: CRUCB (ours); SW-CUCB chen2021combinatorial (combinatorial bandits); R-ed-UCB metelli2022stochastic (rested rising bandits). CRUCB (ours) becomes nearly flat, while SW-CUCB and R-ed-UCB accumulate linear regret. (d) Empirical number of pulls of each edge: CRUCB pulls entirely the late bloomer, SW‐CUCB the early peaker, and R‐ed‐UCB splits pulls roughly evenly.
  • Figure 2: Growth of outcomes.$\mu_i(n)$ induced by $\gamma_i(n)\!=\!Cf(n)$, with $C$ as a normalizing constant.
  • Figure 3: Regret bound gap. The regret lower bound of CRB and the regret upper bound of CRUCB when $f(n)\!=\!(n\!+\!1)^{-c}$. For $c \le 1$, both the upper and lower bounds are equal to $1$. Specifically, for $1\!<\!c\!<\!1.5$, the lower bound ($2\!-\!c$) and the upper bound ($\frac{1}{c}$) are of similar order, indicating that the regret bounds closely match.
  • Figure 4: Online shortest path planning task. (a, c) Graphs used to evaluate CRUCB and baselines. (b, d) Corresponding outcome functions for each task.
  • Figure 5: Cumulative regret in synthetic environments. Regret curves for (a) Path-easy and (b) Path-complex. Lines show average; shaded areas indicate 99% confidence intervals over 5 runs.
  • ...and 11 more figures

Theorems & Definitions (21)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Theorem 3
  • Corollary 2
  • Theorem 4
  • Theorem 5
  • Lemma 1
  • proof
  • ...and 11 more