Zero-Inflated Bandits

Haoyu Wei; Runzhe Wan; Lei Shi; Rui Song

Zero-Inflated Bandits

Haoyu Wei, Runzhe Wan, Lei Shi, Rui Song

TL;DR

This work introduces zero-inflated bandits (ZI) to address learning with sparse rewards by modeling rewards as $R_t = X_t Y_t$ where $X_t = \mu_{A_t} + \varepsilon_t$ and $Y_t \sim \text{Bernoulli}(p_{A_t})$, yielding arm means $r_k = \mu_k p_k$. It develops two complementary algorithmic frameworks—a product-based Upper Confidence Bound (UCB) and a Thompson Sampling (TS) approach—to exploit the separable $\mu$ and $p$ structure, and extends the framework to zero-inflated contextual bandits with GLM-like formulations. The paper provides finite-sample regret bounds for ZI MAB and ZI contextual bandits under sub-Weibull and sub-Gaussian tails, achieving minimax-optimal or near-optimal rates, and demonstrates strong empirical gains on synthetic data and a real loan dataset. These contributions advance bandit theory by incorporating zero-inflation into both theory and practice, enabling more efficient learning in domains with highly sparse rewards.

Abstract

Many real-world bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problem-specific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.

Zero-Inflated Bandits

TL;DR

This work introduces zero-inflated bandits (ZI) to address learning with sparse rewards by modeling rewards as

where

and

, yielding arm means

. It develops two complementary algorithmic frameworks—a product-based Upper Confidence Bound (UCB) and a Thompson Sampling (TS) approach—to exploit the separable

and

structure, and extends the framework to zero-inflated contextual bandits with GLM-like formulations. The paper provides finite-sample regret bounds for ZI MAB and ZI contextual bandits under sub-Weibull and sub-Gaussian tails, achieving minimax-optimal or near-optimal rates, and demonstrates strong empirical gains on synthetic data and a real loan dataset. These contributions advance bandit theory by incorporating zero-inflation into both theory and practice, enabling more efficient learning in domains with highly sparse rewards.

Abstract

Paper Structure (29 sections, 14 theorems, 258 equations, 14 figures, 2 tables, 6 algorithms)

This paper contains 29 sections, 14 theorems, 258 equations, 14 figures, 2 tables, 6 algorithms.

Introduction
Zero-Inflated Multi-Armed Bandits
Proposed product method and upper confidence bound approach
Thompson sampling approach
Zero-Inflated Contextual Bandits
Theory
Regret bounds for ZI MAB
Regret bounds for ZI contextual bandits
Experiment
Discussion
Heavy-tailed MAB
Additional Algorithms Details
Algorithms for MAB
Algorithms for Generalized Linear Contextual Bandits
UCB Algorithms and Regularity Conditions
...and 14 more sections

Key Result

Lemma 2.1

Assuming independent $Y_t \sim \operatorname{Bernoulli}(p)$ and $X_t - \mu \sim \operatorname{subW}(\theta; C)$, let $R_t = X_t \times Y_t$. Then, there exists a constant $C_R > 0$ such that $R_t - \mu p \sim \operatorname{subW}(\theta; C_R)$.

Figures (14)

Figure 1: Results from a real personalized pricing dataset detailed in Section \ref{['sec:experiment']}. (a) Histogram of rewards, with zero represented in orange. (b) $1-\delta$ upper confidence bounds for various methods. We use Monte Carlo to approximate the true quantile (the tightest valid upper confidence bound). All methods are validated as their curves are above the Monte Carlo one. Our method (green) achieves the tightest bound quickly. Notably, using existing concentration inequalities directly on the reward (yellow), even knowing the true size parameter but without utilizing the ZI structure, results in a significantly looser bound.
Figure 2: Zero-inflated MAB with $K = 10$ and $T = 75000$ with $N = 50$ replications for $p \sim U[0.30, 0.35]$.
Figure 3: Zero-inflated contextual bandits with $T = 20000$ and $s = 7$ under $N = 25$ replications.
Figure 4: Results with the real dataset.
Figure 5: Simulation for zero-inflated MAB with $K = 10$ and $T = 75000$ with $N = 50$ replications. The four rows represent $p \sim U[0.10, 0.15]$, $p \sim U[0.15, 0.20]$, $p \sim U[0.20, 0.25]$, and $p \sim U[0.25, 0.30]$, respectively.
...and 9 more figures

Theorems & Definitions (28)

Lemma 2.1
Lemma 2.2
Theorem 4.1
Theorem 4.2
Theorem 4.3
Lemma 6.1
Lemma A.1
Theorem A.2
Corollary B.4
Lemma D.1
...and 18 more

Zero-Inflated Bandits

TL;DR

Abstract

Zero-Inflated Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (28)