Table of Contents
Fetching ...

Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

Zichun Ye, Runqi Wang, Xutong Liu, Shuai Li

TL;DR

This work tackles stochastic combinatorial multi-armed bandits (CMAB) under semi-bandit and cascading feedback, addressing the tension between minimax-optimal regret and computational efficiency. The authors introduce CMOSS, a MOSS-inspired algorithm that uses optimistic per-arm estimates with a refined confidence radius to select feasible actions, thereby removing the $\log T$ term that burdens CUCB-type methods. CMOSS achieves instance-independent regret bounds of $O\big( (\log k)\sqrt{kmT}\big)$ for $k\le \frac{m}{2}$ and $O\big((m-k)\sqrt{\log k\log(m-k)T}\big)$ for $k>\frac{m}{2}$ under semi-bandit feedback, and extends to cascading feedback with a multiplicative factor of $1/p^*$. Empirical results on synthetic and real-world data show CMOSS consistently outperforms baselines in regret while maintaining competitive runtimes, validating its practical potential for large-scale CMAB tasks. The work also provides an extension to cascading feedback and suggests future directions to further tighten dependence on observation probabilities and integrate newer UCB techniques.

Abstract

The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor $\log T$ that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of $O\big( (\log k)\sqrt{kmT}\big )$ when $k\leq \frac{m}{2}$ and $O\big((m-k)\sqrt{\log k\log(m-k)T}\big )$ when $k>\frac{m}{2}$ under semi-bandit feedback, where $m$ is the number of arms and $k$ is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on $\log T$ and matches the established lower bounds of $Ω\big(\sqrt{kmT}\big)$ when $k\leq \frac{m}{2}$ and $Ω\big((m-k)\sqrt{\log (\frac{m}{m-k}) T}\big)$ when $k>\frac{m}{2}$ up to logarithmic terms of $k$ and $m$. We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.

Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

TL;DR

This work tackles stochastic combinatorial multi-armed bandits (CMAB) under semi-bandit and cascading feedback, addressing the tension between minimax-optimal regret and computational efficiency. The authors introduce CMOSS, a MOSS-inspired algorithm that uses optimistic per-arm estimates with a refined confidence radius to select feasible actions, thereby removing the term that burdens CUCB-type methods. CMOSS achieves instance-independent regret bounds of for and for under semi-bandit feedback, and extends to cascading feedback with a multiplicative factor of . Empirical results on synthetic and real-world data show CMOSS consistently outperforms baselines in regret while maintaining competitive runtimes, validating its practical potential for large-scale CMAB tasks. The work also provides an extension to cascading feedback and suggests future directions to further tighten dependence on observation probabilities and integrate newer UCB techniques.

Abstract

The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of when and when under semi-bandit feedback, where is the number of arms and is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on and matches the established lower bounds of when and when up to logarithmic terms of and . We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.

Paper Structure

This paper contains 30 sections, 12 theorems, 63 equations, 5 figures, 11 tables, 5 algorithms.

Key Result

Theorem 1

For any stochastic combinatorial semi-bandit feedback instance with $m$ arms, time horizon $T$, and the maximum number of arms that could be chosen in a round is $k$, then when $k\leq \frac{m}{2}$, the expected regret of alg:cmoss is $O\left((\log k)\sqrt{kmT}\right)$ and when $k > \frac{m}{2}$, the

Figures (5)

  • Figure 1: Comparison of CMOSS (blue) with three baselines algorithms under semi-bandit feedback. Subplots (1)-(4), (9)-(12) use synthetic dataset; (5)-(8) use Yelp dataset. Subplots (1)-(8) show cumulative regret with $k=10, m=30$ and $k=20, m=30$; (9)(10) show ablation studies varying $k$ (fixed $m=30$), while (11)(12) varying $m$ (fixed $k=15$). Initial means of base arms fall within the range $[0, 0.1]$, except for (2)(4)(6)(8), which use the range $[0.3,0.4]$.
  • Figure 2: The plot of $f(x) = \frac{\ln\left(\frac{k^2 x^2}{\delta}\right)}{x^2}$.
  • Figure 3: Ablation study of EXP3.M with varying mixing coefficient $\gamma \in \{0.001, 0.01, 0.1\}$.
  • Figure 4: Comparison of CMOSS (blue) with CUCB algorithms under cascading (descending) feedback. Subplots (1)(2)(5)(6) show cumulative regret with fixed $k=10$, $m=30$; (3)(4) show ablation studies varying $k$ (fixed $m=30$), while (7)(8) varying $m$ (fixed $k=15$). Subplots (1)(2)(3)(4)(7)(8) use synthetic dataset; (5)(6) use Yelp dataset. Initial means of base arms fall within the range $[0, 0.1]$, except for (2)(6), which use the range $[0.3,0.4]$.
  • Figure 5: Comparison of CMOSS (blue) with CUCB algorithms under cascading (ascending) feedback. Subplots (1)(2)(5)(6) show cumulative regret with fixed $k=10$, $m=30$; (3)(4) show ablation studies varying $k$ (fixed $m=30$), while (7)(8) varying $m$ (fixed $k=15$). Subplots (1)(2)(3)(4)(7)(8) use synthetic dataset; (5)(6) use Yelp dataset. Initial means of base arms fall within the range $[0, 0.1]$, except for (2)(6), which use the range $[0.3,0.4]$.

Theorems & Definitions (18)

  • Theorem 1
  • Lemma 2
  • Lemma 3: Indicator decomposition lemma
  • Lemma 4
  • Theorem 5
  • Lemma 6: Hoeffding's lemma, MU17
  • Lemma 7: Generalization of Hoeffding’s inequality, MU17
  • proof
  • Lemma 8: LS20
  • proof
  • ...and 8 more