Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits
Zichun Ye, Runqi Wang, Xutong Liu, Shuai Li
TL;DR
This work tackles stochastic combinatorial multi-armed bandits (CMAB) under semi-bandit and cascading feedback, addressing the tension between minimax-optimal regret and computational efficiency. The authors introduce CMOSS, a MOSS-inspired algorithm that uses optimistic per-arm estimates with a refined confidence radius to select feasible actions, thereby removing the $\log T$ term that burdens CUCB-type methods. CMOSS achieves instance-independent regret bounds of $O\big( (\log k)\sqrt{kmT}\big)$ for $k\le \frac{m}{2}$ and $O\big((m-k)\sqrt{\log k\log(m-k)T}\big)$ for $k>\frac{m}{2}$ under semi-bandit feedback, and extends to cascading feedback with a multiplicative factor of $1/p^*$. Empirical results on synthetic and real-world data show CMOSS consistently outperforms baselines in regret while maintaining competitive runtimes, validating its practical potential for large-scale CMAB tasks. The work also provides an extension to cascading feedback and suggests future directions to further tighten dependence on observation probabilities and integrate newer UCB techniques.
Abstract
The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor $\log T$ that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of $O\big( (\log k)\sqrt{kmT}\big )$ when $k\leq \frac{m}{2}$ and $O\big((m-k)\sqrt{\log k\log(m-k)T}\big )$ when $k>\frac{m}{2}$ under semi-bandit feedback, where $m$ is the number of arms and $k$ is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on $\log T$ and matches the established lower bounds of $Ω\big(\sqrt{kmT}\big)$ when $k\leq \frac{m}{2}$ and $Ω\big((m-k)\sqrt{\log (\frac{m}{m-k}) T}\big)$ when $k>\frac{m}{2}$ up to logarithmic terms of $k$ and $m$. We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.
