Table of Contents
Fetching ...

From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards

Liad Erez, Tomer Koren

TL;DR

The paper investigates contextual combinatorial semi-bandits (CCSB) under an $s$-sparse reward regime, where the total reward across a chosen subset is bounded by $s\ll K$. It develops PAC and regret guarantees that scale with sparsity rather than the ambient action count, leveraging a two-phase algorithm that first finds a low-variance exploration distribution via Frank-Wolfe and then estimates policy rewards with importance sampling, resulting in a PAC sample complexity of $\tilde{O}((\mathrm{poly}(K/m) + sm/\varepsilon^2)\log(|\Pi|/\delta))$ and a regret bound $\tilde{O}(|\Pi| + \sqrt{smT\log|\Pi|})$. The approach extends to bandit multiclass list classification and yields improved single-label bounds $O(((K^7) + 1/\varepsilon^2) \log(|\mathcal{H}|/\delta))$ in that regime. A matching lower bound shows the $sm/\varepsilon^2$ term is essential, and the paper discusses open questions for full-bandit feedback, tighter $K$-dependence, and infinite class extensions, with implications for scalable, sparsity-adaptive bandit learning in recommendation and similar settings.

Abstract

We study the problem of contextual combinatorial semi-bandits, where input contexts are mapped into subsets of size $m$ of a collection of $K$ possible actions. In each round, the learner observes the realized reward of the predicted actions. Motivated by prototypical applications of contextual bandits, we focus on the $s$-sparse regime where we assume that the sum of rewards is bounded by some value $s\ll K$. For example, in recommendation systems the number of products purchased by any customer is significantly smaller than the total number of available products. Our main result is for the $(ε,δ)$-PAC variant of the problem for which we design an algorithm that returns an $ε$-optimal policy with high probability using a sample complexity of $\tilde{O}((poly(K/m)+sm/ε^2) \log(|Π|/δ))$ where $Π$ is the underlying (finite) class and $s$ is the sparsity parameter. This bound improves upon known bounds for combinatorial semi-bandits whenever $s\ll K$, and in the regime where $s=O(1)$, the leading terms in our bound match the corresponding full-information rates, implying that bandit feedback essentially comes at no cost. Our algorithm is also computationally efficient given access to an ERM oracle for $Π$. Our framework generalizes the list multiclass classification problem with bandit feedback, which can be seen as a special case with binary reward vectors. In the special case of single-label classification corresponding to $s=m=1$, we prove an $O((K^7+1/ε^2)\log(|H|/δ))$ sample complexity bound, which improves upon recent results in this scenario. Additionally, we consider the regret minimization setting where data can be generated adversarially, and establish a regret bound of $\tilde O(|Π|+\sqrt{smT\log |Π|})$, extending the result of Erez et al. (2024) who consider the simpler single label classification setting.

From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards

TL;DR

The paper investigates contextual combinatorial semi-bandits (CCSB) under an -sparse reward regime, where the total reward across a chosen subset is bounded by . It develops PAC and regret guarantees that scale with sparsity rather than the ambient action count, leveraging a two-phase algorithm that first finds a low-variance exploration distribution via Frank-Wolfe and then estimates policy rewards with importance sampling, resulting in a PAC sample complexity of and a regret bound . The approach extends to bandit multiclass list classification and yields improved single-label bounds in that regime. A matching lower bound shows the term is essential, and the paper discusses open questions for full-bandit feedback, tighter -dependence, and infinite class extensions, with implications for scalable, sparsity-adaptive bandit learning in recommendation and similar settings.

Abstract

We study the problem of contextual combinatorial semi-bandits, where input contexts are mapped into subsets of size of a collection of possible actions. In each round, the learner observes the realized reward of the predicted actions. Motivated by prototypical applications of contextual bandits, we focus on the -sparse regime where we assume that the sum of rewards is bounded by some value . For example, in recommendation systems the number of products purchased by any customer is significantly smaller than the total number of available products. Our main result is for the -PAC variant of the problem for which we design an algorithm that returns an -optimal policy with high probability using a sample complexity of where is the underlying (finite) class and is the sparsity parameter. This bound improves upon known bounds for combinatorial semi-bandits whenever , and in the regime where , the leading terms in our bound match the corresponding full-information rates, implying that bandit feedback essentially comes at no cost. Our algorithm is also computationally efficient given access to an ERM oracle for . Our framework generalizes the list multiclass classification problem with bandit feedback, which can be seen as a special case with binary reward vectors. In the special case of single-label classification corresponding to , we prove an sample complexity bound, which improves upon recent results in this scenario. Additionally, we consider the regret minimization setting where data can be generated adversarially, and establish a regret bound of , extending the result of Erez et al. (2024) who consider the simpler single label classification setting.

Paper Structure

This paper contains 29 sections, 15 theorems, 97 equations, 4 algorithms.

Key Result

Theorem 1

If we set $\gamma = \frac{1}{2}$, $N_1 = \widetilde{\Theta} \brk[big]{\frac{K^9}{m^8} \log (|\Pi|/\delta))}$, $N_2 = \Theta \brk*{\brk*{K/m \varepsilon + s m / \varepsilon^2}\log(|\Pi| / \delta) }$, $T=\Theta \brk[big]{(K/m)^5}$, then with probability at least $1-\delta$alg:pac-comband outputs $\pi_ Furthermore, alg:pac-comband makes a total of $T+1 = O \brk[big]{(K/m)^5}$ calls to $\textsf{ERM}_\

Theorems & Definitions (29)

  • Theorem 1
  • proof : Proof of \ref{['thm:pac-main']} (sketch).
  • Lemma 1
  • Lemma 2
  • Theorem 2
  • proof : Proof of \ref{['thm:lower-bound']} (sketch).
  • Lemma 3
  • proof
  • proof : Proof of \ref{['lem:log-self-concordance']}
  • proof : Proof of \ref{['thm:pac-main']}
  • ...and 19 more