Table of Contents
Fetching ...

Rising Multi-Armed Bandits with Known Horizons

Seockbean Song, Chenyu Gan, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

TL;DR

This work studies Rising Multi-Armed Bandits under known finite horizons, where optimal strategies depend on the remaining budget $T$. It introduces CURE-UCB, a horizon-aware UCB-style algorithm that estimates the cumulative reward an arm can yield over the rest of the horizon via a horizon-adaptive index $B_i(t)$, which combines a recent mean, a projected future gain, and an exploration bonus. The authors prove a strict dominance of CURE-UCB over horizon-agnostic baselines in Linear-Then-Flat settings and derive a general regret bound for concave rising environments via a cumulative increment measure, with extensive experiments showing practical gains on synthetic tasks and online model selection (IMDB). The results highlight the importance of horizon awareness for efficient decision-making in finite-horizon RMABs and point to broad applicability in hyperparameter tuning and robotics.

Abstract

The Rising Multi-Armed Bandit (RMAB) framework models environments where expected rewards of arms increase with plays, which models practical scenarios where performance of each option improves with the repeated usage, such as in robotics and hyperparameter tuning. For instance, in hyperparameter tuning, the validation accuracy of a model configuration (arm) typically increases with each training epoch. A defining characteristic of RMAB is em horizon-dependent optimality: unlike standard settings, the optimal strategy here shifts dramatically depending on the available budget $T$. This implies that knowledge of $T$ yields significantly greater utility in RMAB, empowering the learner to align its decision-making with this shifting optimality. However, the horizon-aware setting remains underexplored. To address this, we propose a novel CUmulative Reward Estimation UCB (CURE-UCB) that explicitly integrates the horizon. We provide a rigorous analysis establishing a new regret upper bound and prove that our method strictly outperforms horizon-agnostic strategies in structured environments like ``linear-then-flat'' instances. Extensive experiments demonstrate its significant superiority over baselines.

Rising Multi-Armed Bandits with Known Horizons

TL;DR

This work studies Rising Multi-Armed Bandits under known finite horizons, where optimal strategies depend on the remaining budget . It introduces CURE-UCB, a horizon-aware UCB-style algorithm that estimates the cumulative reward an arm can yield over the rest of the horizon via a horizon-adaptive index , which combines a recent mean, a projected future gain, and an exploration bonus. The authors prove a strict dominance of CURE-UCB over horizon-agnostic baselines in Linear-Then-Flat settings and derive a general regret bound for concave rising environments via a cumulative increment measure, with extensive experiments showing practical gains on synthetic tasks and online model selection (IMDB). The results highlight the importance of horizon awareness for efficient decision-making in finite-horizon RMABs and point to broad applicability in hyperparameter tuning and robotics.

Abstract

The Rising Multi-Armed Bandit (RMAB) framework models environments where expected rewards of arms increase with plays, which models practical scenarios where performance of each option improves with the repeated usage, such as in robotics and hyperparameter tuning. For instance, in hyperparameter tuning, the validation accuracy of a model configuration (arm) typically increases with each training epoch. A defining characteristic of RMAB is em horizon-dependent optimality: unlike standard settings, the optimal strategy here shifts dramatically depending on the available budget . This implies that knowledge of yields significantly greater utility in RMAB, empowering the learner to align its decision-making with this shifting optimality. However, the horizon-aware setting remains underexplored. To address this, we propose a novel CUmulative Reward Estimation UCB (CURE-UCB) that explicitly integrates the horizon. We provide a rigorous analysis establishing a new regret upper bound and prove that our method strictly outperforms horizon-agnostic strategies in structured environments like ``linear-then-flat'' instances. Extensive experiments demonstrate its significant superiority over baselines.
Paper Structure (29 sections, 5 theorems, 46 equations, 11 figures, 3 algorithms)

This paper contains 29 sections, 5 theorems, 46 equations, 11 figures, 3 algorithms.

Key Result

Proposition 3.2

For a rising bandit problem with a finite horizon $T$, the optimal policy $\pi^*$ consists of selecting a single arm $i^* \in [K]$ and playing it for all $t\in T$. This arm $i^*$ is determined by maximizing the cumulative reward over the horizon $T$:

Figures (11)

  • Figure 1: Demonstration of Horizon-Adaptiveness. (a) Expected reward functions. The Arm A represents an Early Peaker (high initial reward, limited growth), and the Arm B represents a Late Bloomer (low initial reward, high potential). (b) Cumulative regret results across varying time horizons $T$. CURE-UCB (Ours) consistently achieves small regret across all horizons. In contrast, baselines suffer structural failures. R-ed-UCB (horizon-agnostic rising bandit algorithm) incurs high regret in short horizons ($T$=10,000) whereas SW-UCB (non-stationary bandit algorithm) fails in long horizons ($T$=30,000).
  • Figure 2: Comparison of estimation behaviors: horizon-aware (CURE-UCB) vs. horizon-agnostic (R-ed-UCB). The Horizon Limit denotes the maximum reachable point given the remaining trials. Since this horizon ($T-t$) is identical for all arms, we visualize CURE-UCB using its midpoint derived from the cumulative area; normalizing by this common factor preserves the arm ranking. (a) Early Stage: R-ed-UCB remains conservative due to small $t$, whereas CURE-UCB targets a point closer to the horizon limit, enabling aggressive exploration. (b) Late Stage: CURE-UCB shifts to exploitation within the limit, while R-ed-UCB overshoots into the impossible region.
  • Figure 3: Example instances of synthesized environments.(a) The Linear-Then-Flat (LTF) setting, where the expected reward increases linearly before reaching a saturation point. (b) The concave setting, characterizing non-linear growth dynamics.
  • Figure 4: Performance Analysis in LTF Setting. (a) Cumulative regret as a function of the time horizon $T$. CURE-UCB consistently achieves the lowest regret across all horizons. (b, c) Average Rank at $T$=10,000 (short horizon) and $T$=50,000 (long horizon), respectively. Lower rank indicates better performance. CURE-UCB maintains the lowest rank in both horizons. Shaded regions and error bars denote 95% confidence intervals.
  • Figure 5: Performance Analysis in Concave Setting. (a) Cumulative regret as a function of the time horizon $T$. CURE-UCB consistently achieves the lowest regret across all horizons. (b, c) Average Rank at $T$=10,000 (short horizon) and $T$=50,000 (long horizon), respectively. Lower rank indicates better performance. CURE-UCB maintains the lowest rank in both horizons. Shaded regions and error bars denote 95% confidence intervals.
  • ...and 6 more figures

Theorems & Definitions (14)

  • Proposition 3.2: Structure of Optimal Policy
  • Definition 5.1
  • Theorem 5.2: Strict Dominance in deterministic LTF
  • Definition 5.3
  • Theorem 5.4: Regret upper bound for general case
  • Claim 1
  • Claim 2
  • Claim 3
  • Claim 4
  • proof
  • ...and 4 more