Continuous K-Max Bandits
Yu Chen, Siwei Wang, Longbo Huang, Wei Chen
TL;DR
This work addresses continuous $K$-Max bandits with value-index feedback, where only the winner's value and identity are observed per round. It introduces DCK-UCB, a computationally efficient algorithm that couples adaptive discretization with bias-corrected confidence bounds and an offline PTAS oracle to obtain sublinear regret $\widetilde{\mathcal{O}}(T^{3/4})$ for general continuous distributions. A special case with exponential outcomes yields a tighter $\widetilde{\mathcal{O}}(\sqrt{T})$ regret via an MLE-based approach (MLE-Exp) under full-bandit feedback, leveraging the minimum-of-exponentials property. Together, these results advance the theory and practice of learning under partial, biased value-index feedback in continuous-action bandit problems, with practical implications for recommendation, distributed computing, and server scheduling.
Abstract
We study the $K$-Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback: each base arm has an unknown continuous outcome distribution, and in each round the learning agent selects $K$ arms, obtains the maximum value sampled from these $K$ arms as reward and observes this reward together with the corresponding arm index as feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. The continuous $K$-Max bandits introduce unique challenges, including discretization error from continuous-to-discrete conversion, non-deterministic tie-breaking under limited feedback, and biased estimation due to partial observability. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds to tackle these challenges. For general continuous distributions, we prove that DCK-UCB achieves a $\widetilde{\mathcal{O}}(T^{3/4})$ regret upper bound, establishing the first sublinear regret guarantee for this setting. Furthermore, we identify an important special case with exponential distributions under full-bandit feedback. In this case, our proposed algorithm MLE-Exp enables $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound through maximal log-likelihood estimation, achieving near-minimax optimality.
