Table of Contents
Fetching ...

Zero-Inflated Bandits

Haoyu Wei, Runzhe Wan, Lei Shi, Rui Song

TL;DR

This work introduces zero-inflated bandits (ZI) to address learning with sparse rewards by modeling rewards as $R_t = X_t Y_t$ where $X_t = \mu_{A_t} + \varepsilon_t$ and $Y_t \sim \text{Bernoulli}(p_{A_t})$, yielding arm means $r_k = \mu_k p_k$. It develops two complementary algorithmic frameworks—a product-based Upper Confidence Bound (UCB) and a Thompson Sampling (TS) approach—to exploit the separable $\mu$ and $p$ structure, and extends the framework to zero-inflated contextual bandits with GLM-like formulations. The paper provides finite-sample regret bounds for ZI MAB and ZI contextual bandits under sub-Weibull and sub-Gaussian tails, achieving minimax-optimal or near-optimal rates, and demonstrates strong empirical gains on synthetic data and a real loan dataset. These contributions advance bandit theory by incorporating zero-inflation into both theory and practice, enabling more efficient learning in domains with highly sparse rewards.

Abstract

Many real-world bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problem-specific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.

Zero-Inflated Bandits

TL;DR

This work introduces zero-inflated bandits (ZI) to address learning with sparse rewards by modeling rewards as where and , yielding arm means . It develops two complementary algorithmic frameworks—a product-based Upper Confidence Bound (UCB) and a Thompson Sampling (TS) approach—to exploit the separable and structure, and extends the framework to zero-inflated contextual bandits with GLM-like formulations. The paper provides finite-sample regret bounds for ZI MAB and ZI contextual bandits under sub-Weibull and sub-Gaussian tails, achieving minimax-optimal or near-optimal rates, and demonstrates strong empirical gains on synthetic data and a real loan dataset. These contributions advance bandit theory by incorporating zero-inflation into both theory and practice, enabling more efficient learning in domains with highly sparse rewards.

Abstract

Many real-world bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problem-specific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.
Paper Structure (29 sections, 14 theorems, 258 equations, 14 figures, 2 tables, 6 algorithms)

This paper contains 29 sections, 14 theorems, 258 equations, 14 figures, 2 tables, 6 algorithms.

Key Result

Lemma 2.1

Assuming independent $Y_t \sim \operatorname{Bernoulli}(p)$ and $X_t - \mu \sim \operatorname{subW}(\theta; C)$, let $R_t = X_t \times Y_t$. Then, there exists a constant $C_R > 0$ such that $R_t - \mu p \sim \operatorname{subW}(\theta; C_R)$.

Figures (14)

  • Figure 1: Results from a real personalized pricing dataset detailed in Section \ref{['sec:experiment']}. (a) Histogram of rewards, with zero represented in orange. (b) $1-\delta$ upper confidence bounds for various methods. We use Monte Carlo to approximate the true quantile (the tightest valid upper confidence bound). All methods are validated as their curves are above the Monte Carlo one. Our method (green) achieves the tightest bound quickly. Notably, using existing concentration inequalities directly on the reward (yellow), even knowing the true size parameter but without utilizing the ZI structure, results in a significantly looser bound.
  • Figure 2: Zero-inflated MAB with $K = 10$ and $T = 75000$ with $N = 50$ replications for $p \sim U[0.30, 0.35]$.
  • Figure 3: Zero-inflated contextual bandits with $T = 20000$ and $s = 7$ under $N = 25$ replications.
  • Figure 4: Results with the real dataset.
  • Figure 5: Simulation for zero-inflated MAB with $K = 10$ and $T = 75000$ with $N = 50$ replications. The four rows represent $p \sim U[0.10, 0.15]$, $p \sim U[0.15, 0.20]$, $p \sim U[0.20, 0.25]$, and $p \sim U[0.25, 0.30]$, respectively.
  • ...and 9 more figures

Theorems & Definitions (28)

  • Lemma 2.1
  • Lemma 2.2
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Lemma 6.1
  • Lemma A.1
  • Theorem A.2
  • Corollary B.4
  • Lemma D.1
  • ...and 18 more