Table of Contents
Fetching ...

Towards Efficient and Optimal Covariance-Adaptive Algorithms for Combinatorial Semi-Bandits

Julien Zhou, Pierre Gaillard, Thibaud Rahier, Houssam Zenati, Julyan Arbel

TL;DR

This work addresses the problem of stochastic combinatorial semi-bandits, where a player selects among P actions from the power set of a set containing d base items, and designs "optimistic"covariance-adaptive algorithms relying on online estimations of the covariance structure, called OLS-UCB-C and COS-V (only the variances for the latter).

Abstract

We address the problem of stochastic combinatorial semi-bandits, where a player selects among P actions from the power set of a set containing d base items. Adaptivity to the problem's structure is essential in order to obtain optimal regret upper bounds. As estimating the coefficients of a covariance matrix can be manageable in practice, leveraging them should improve the regret. We design "optimistic" covariance-adaptive algorithms relying on online estimations of the covariance structure, called OLS-UCB-C and COS-V (only the variances for the latter). They both yields improved gap-free regret. Although COS-V can be slightly suboptimal, it improves on computational complexity by taking inspiration from ThompsonSampling approaches. It is the first sampling-based algorithm satisfying a T^1/2 gap-free regret (up to poly-logs). We also show that in some cases, our approach efficiently leverages the semi-bandit feedback and outperforms bandit feedback approaches, not only in exponential regimes where P >> d but also when P <= d, which is not covered by existing analyses.

Towards Efficient and Optimal Covariance-Adaptive Algorithms for Combinatorial Semi-Bandits

TL;DR

This work addresses the problem of stochastic combinatorial semi-bandits, where a player selects among P actions from the power set of a set containing d base items, and designs "optimistic"covariance-adaptive algorithms relying on online estimations of the covariance structure, called OLS-UCB-C and COS-V (only the variances for the latter).

Abstract

We address the problem of stochastic combinatorial semi-bandits, where a player selects among P actions from the power set of a set containing d base items. Adaptivity to the problem's structure is essential in order to obtain optimal regret upper bounds. As estimating the coefficients of a covariance matrix can be manageable in practice, leveraging them should improve the regret. We design "optimistic" covariance-adaptive algorithms relying on online estimations of the covariance structure, called OLS-UCB-C and COS-V (only the variances for the latter). They both yields improved gap-free regret. Although COS-V can be slightly suboptimal, it improves on computational complexity by taking inspiration from ThompsonSampling approaches. It is the first sampling-based algorithm satisfying a T^1/2 gap-free regret (up to poly-logs). We also show that in some cases, our approach efficiently leverages the semi-bandit feedback and outperforms bandit feedback approaches, not only in exponential regimes where P >> d but also when P <= d, which is not covered by existing analyses.
Paper Structure (63 sections, 32 theorems, 179 equations, 6 figures, 1 table, 3 algorithms)

This paper contains 63 sections, 32 theorems, 179 equations, 6 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

Let $T\in\mathbb{N}^*$ and $\delta>0$. Then, OLS-UCB-C (Alg. alg:OLSUCBC) satisfies the gap-dependent regret upper bound where $\sigma_{a,i}^2 = \sum_{j \in a} \max\{\mathbf{\Sigma}_{i,j},0\}$, and the gap-free regret upper bound

Figures (6)

  • Figure 1: Evolution of regret upper bounds.
  • Figure 2: Pseudo-regret for ESCB-C and OLS-UCB-C for randomly sampled environments (with q25 and q75 confidence intervals).
  • Figure 3: Pseudo-Regret with respect to $1/\Delta_{\min }$.
  • Figure 4: Pseudo-Regret in the "worst" environment.
  • Figure : OLS-UCB-C
  • ...and 1 more figures

Theorems & Definitions (47)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Theorem 3
  • proof
  • Proposition 4
  • ...and 37 more