Table of Contents
Fetching ...

Continuous K-Max Bandits

Yu Chen, Siwei Wang, Longbo Huang, Wei Chen

TL;DR

This work addresses continuous $K$-Max bandits with value-index feedback, where only the winner's value and identity are observed per round. It introduces DCK-UCB, a computationally efficient algorithm that couples adaptive discretization with bias-corrected confidence bounds and an offline PTAS oracle to obtain sublinear regret $\widetilde{\mathcal{O}}(T^{3/4})$ for general continuous distributions. A special case with exponential outcomes yields a tighter $\widetilde{\mathcal{O}}(\sqrt{T})$ regret via an MLE-based approach (MLE-Exp) under full-bandit feedback, leveraging the minimum-of-exponentials property. Together, these results advance the theory and practice of learning under partial, biased value-index feedback in continuous-action bandit problems, with practical implications for recommendation, distributed computing, and server scheduling.

Abstract

We study the $K$-Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback: each base arm has an unknown continuous outcome distribution, and in each round the learning agent selects $K$ arms, obtains the maximum value sampled from these $K$ arms as reward and observes this reward together with the corresponding arm index as feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. The continuous $K$-Max bandits introduce unique challenges, including discretization error from continuous-to-discrete conversion, non-deterministic tie-breaking under limited feedback, and biased estimation due to partial observability. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds to tackle these challenges. For general continuous distributions, we prove that DCK-UCB achieves a $\widetilde{\mathcal{O}}(T^{3/4})$ regret upper bound, establishing the first sublinear regret guarantee for this setting. Furthermore, we identify an important special case with exponential distributions under full-bandit feedback. In this case, our proposed algorithm MLE-Exp enables $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound through maximal log-likelihood estimation, achieving near-minimax optimality.

Continuous K-Max Bandits

TL;DR

This work addresses continuous -Max bandits with value-index feedback, where only the winner's value and identity are observed per round. It introduces DCK-UCB, a computationally efficient algorithm that couples adaptive discretization with bias-corrected confidence bounds and an offline PTAS oracle to obtain sublinear regret for general continuous distributions. A special case with exponential outcomes yields a tighter regret via an MLE-based approach (MLE-Exp) under full-bandit feedback, leveraging the minimum-of-exponentials property. Together, these results advance the theory and practice of learning under partial, biased value-index feedback in continuous-action bandit problems, with practical implications for recommendation, distributed computing, and server scheduling.

Abstract

We study the -Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback: each base arm has an unknown continuous outcome distribution, and in each round the learning agent selects arms, obtains the maximum value sampled from these arms as reward and observes this reward together with the corresponding arm index as feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. The continuous -Max bandits introduce unique challenges, including discretization error from continuous-to-discrete conversion, non-deterministic tie-breaking under limited feedback, and biased estimation due to partial observability. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds to tackle these challenges. For general continuous distributions, we prove that DCK-UCB achieves a regret upper bound, establishing the first sublinear regret guarantee for this setting. Furthermore, we identify an important special case with exponential distributions under full-bandit feedback. In this case, our proposed algorithm MLE-Exp enables regret upper bound through maximal log-likelihood estimation, achieving near-minimax optimality.

Paper Structure

This paper contains 38 sections, 19 theorems, 150 equations, 2 algorithms.

Key Result

Lemma 4.2

For any $S \in {\mathcal{S}}$, we have

Theorems & Definitions (34)

  • Lemma 4.2
  • Lemma 4.3
  • Lemma 4.4: wang2023combinatorial
  • Lemma 4.5
  • Theorem 4.6
  • Lemma 4.7
  • Theorem 5.1
  • Lemma A.1
  • proof
  • Lemma A.2
  • ...and 24 more