Combinatorial Rising Bandit
Seockbean Song, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok
TL;DR
CRB models combinatorial online learning where pulling base arms increases future rewards for all super arms that include them. CRUCB combines a Future-UCB per-base-arm index with a solver to pick the best super arm, and is shown to achieve regret bounds that are near-tight by matching lower bounds for CRB. Theoretical results characterize the optimality landscape, showing constant policies may be near-optimal under additive reward, while general CRB requires more nuanced strategies. Empirical results across synthetic shortest-path tasks and AntMaze deep RL demonstrate that CRUCB outperforms baselines, highlighting its practical relevance for rising-reward, structured decision problems with overlapping actions.
Abstract
Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards. It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems. In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots enhancing proficiency through practice and social influence strengthening in the history of successful recommendations. Moreover, the enhancement of a single base arm may affect multiple super arms that include it, introducing complex dependencies that are not captured by existing rising bandit models. To address this, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB). We establish an upper bound on regret CRUCB and show that it is nearly tight by deriving a matching lower bound. In addition, we empirically demonstrate the effectiveness of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.
