Zero-Inflated Bandits
Haoyu Wei, Runzhe Wan, Lei Shi, Rui Song
TL;DR
This work introduces zero-inflated bandits (ZI) to address learning with sparse rewards by modeling rewards as $R_t = X_t Y_t$ where $X_t = \mu_{A_t} + \varepsilon_t$ and $Y_t \sim \text{Bernoulli}(p_{A_t})$, yielding arm means $r_k = \mu_k p_k$. It develops two complementary algorithmic frameworks—a product-based Upper Confidence Bound (UCB) and a Thompson Sampling (TS) approach—to exploit the separable $\mu$ and $p$ structure, and extends the framework to zero-inflated contextual bandits with GLM-like formulations. The paper provides finite-sample regret bounds for ZI MAB and ZI contextual bandits under sub-Weibull and sub-Gaussian tails, achieving minimax-optimal or near-optimal rates, and demonstrates strong empirical gains on synthetic data and a real loan dataset. These contributions advance bandit theory by incorporating zero-inflation into both theory and practice, enabling more efficient learning in domains with highly sparse rewards.
Abstract
Many real-world bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problem-specific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.
