Table of Contents
Fetching ...

Selective Reviews of Bandit Problems in AI via a Statistical View

Pengjie Zhou, Haoyu Wei, Huiming Zhang

TL;DR

The paper surveys the probabilistic and statistical foundations of stochastic bandit problems, including MAB, SCAB, and contextual/continuum variants, with a focus on non-asymptotic tools such as concentration inequalities and regret bounds. It contrasts frequentist and Bayesian exploration–exploitation strategies (ETC, UCB, MOSS, TS, MOTS) and connects SCAB to functional data analysis via smooth reward functions and Gaussian processes. A key contribution is synthesizing non-asymptotic regret analyses and minimax lower bounds to illuminate fundamental limits and guide algorithm design, while highlighting practical implications for sequential decision-making under uncertainty. The work emphasizes how concentration-based methods and Bayesian optimization frameworks inform robust, scalable bandit policies across discrete, contextual, and continuum action spaces. The synthesis has practical impact for online experiments, personalized decision-making, and adaptive design stemming from rigorous non-asymptotic guarantees.

Abstract

Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. Additionally, we explore K-armed contextual bandits and SCAB, focusing on their methodologies and regret analyses. We also examine the connections between SCAB problems and functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.

Selective Reviews of Bandit Problems in AI via a Statistical View

TL;DR

The paper surveys the probabilistic and statistical foundations of stochastic bandit problems, including MAB, SCAB, and contextual/continuum variants, with a focus on non-asymptotic tools such as concentration inequalities and regret bounds. It contrasts frequentist and Bayesian exploration–exploitation strategies (ETC, UCB, MOSS, TS, MOTS) and connects SCAB to functional data analysis via smooth reward functions and Gaussian processes. A key contribution is synthesizing non-asymptotic regret analyses and minimax lower bounds to illuminate fundamental limits and guide algorithm design, while highlighting practical implications for sequential decision-making under uncertainty. The work emphasizes how concentration-based methods and Bayesian optimization frameworks inform robust, scalable bandit policies across discrete, contextual, and continuum action spaces. The synthesis has practical impact for online experiments, personalized decision-making, and adaptive design stemming from rigorous non-asymptotic guarantees.

Abstract

Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. Additionally, we explore K-armed contextual bandits and SCAB, focusing on their methodologies and regret analyses. We also examine the connections between SCAB problems and functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.

Paper Structure

This paper contains 9 sections, 9 theorems, 43 equations, 1 figure.

Key Result

Lemma 1

Let $\varphi(x) : \mathbb{R} \to \mathbb{R}^{+}$ be a non-decreasing function. For r.v. $X$ with $E[\varphi(X)] < \infty$,

Figures (1)

  • Figure S1: A player plays at a three-armed bandit machine in a casino.

Theorems & Definitions (21)

  • Definition 1
  • Example 1: $K$-armed bandits, MAB
  • Example 2: Stochastic linear bandit, SLB
  • Example 3: Stochastic Contextual Bandits, SCB
  • Lemma 1: Markov's Inequality
  • Lemma 2: Chebyshev's inequality
  • Lemma 3: A refined Mill's Inequality
  • Example 4: $O(a^{-2})\text{-decay tail inequality is not enough}$
  • Lemma 4: Chernoff's inequality, or exponential Markov inequality
  • Lemma 5: Hoeffding's inequality, Theorem 2 in doi:10.1080/01621459.1963.10500830
  • ...and 11 more