Table of Contents
Fetching ...

Asymptotic Theory and Sequential Test for General Multi-Armed Bandit Process

Li Yang, Xiaodong Yan, Dandan Jiang

TL;DR

Urn Bandit (UNB) process is proposed to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure convergence of resource allocation to optimal arms, and establishes the joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.i.d sequence and sublinear information.

Abstract

Multi-armed bandit (MAB) processes constitute a foundational subclass of reinforcement learning problems and represent a central topic in statistical decision theory, but are limited to simultaneous adaptive allocation and sequential test, because of the absence of asymptotic theory under non-i.i.d sequence and sublinear information. To address this open challenge, we propose Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure convergence of resource allocation to optimal arms. We establish the joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d., non-sub-Gaussian and sublinear reward samples with pairwise correlations across arms. To overcome the limitations of existing methods that focus mainly on cumulative regret, we establish the asymptotic theory along with adaptive allocation that serves powerful sequential test, such as arms comparison, A/B testing, and policy valuation. Simulation studies and real data analysis demonstrate that UNB maintains statistical test performance of equal randomization (ER) design but obtain more average rewards like classical MAB processes.

Asymptotic Theory and Sequential Test for General Multi-Armed Bandit Process

TL;DR

Urn Bandit (UNB) process is proposed to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure convergence of resource allocation to optimal arms, and establishes the joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.i.d sequence and sublinear information.

Abstract

Multi-armed bandit (MAB) processes constitute a foundational subclass of reinforcement learning problems and represent a central topic in statistical decision theory, but are limited to simultaneous adaptive allocation and sequential test, because of the absence of asymptotic theory under non-i.i.d sequence and sublinear information. To address this open challenge, we propose Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure convergence of resource allocation to optimal arms. We establish the joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d., non-sub-Gaussian and sublinear reward samples with pairwise correlations across arms. To overcome the limitations of existing methods that focus mainly on cumulative regret, we establish the asymptotic theory along with adaptive allocation that serves powerful sequential test, such as arms comparison, A/B testing, and policy valuation. Simulation studies and real data analysis demonstrate that UNB maintains statistical test performance of equal randomization (ER) design but obtain more average rewards like classical MAB processes.
Paper Structure (18 sections, 13 theorems, 46 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 18 sections, 13 theorems, 46 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem 3.1

Under Assumptions meanbound--meanconver, the estimators of the expected rewards are asymptotically normally distributed, and satisfy that and the estimated covariance matrix $\widehat{\bm{\Sigma}}_n$ is defined element-wise by:

Figures (6)

  • Figure 1: Simulation Type I error rate (Size) of the UNB test and the naive classical test under the null hypothesis $H_0: \mu_1=\mu_2$ with a fixed batch size of $N_t=4$ across different cross-arm dependence. UNB test maintains the Type I error rate.
  • Figure 2: Asymptotic power curves of different allocation strategies for the two–arm test \ref{['eqtwoarm']} across different difference of arms' expected reward $\Delta$ with $S=2000$ and different sample size $S$ at $\Delta=0.5$. UNB closely approximates the ER benchmark which maximizes power in both cases.
  • Figure 3: Dual-axis plots of ASN (left axis, solid lines) and $S_{\text{inf}}$ (right axis, dashed lines) versus $\Delta$ for reward distributions under the information-based sequential design. UNB balance ethics and statistical efficiency in terms of the similar ASN to ER but smaller $S_{\text{inf}}$.
  • Figure 4: Loss index \ref{['eq:loss_function']}, $L_{\lambda}=\mathrm{ASN}+\lambda S_{\text{inf}}$, as a function of $\Delta$ for different reward distributions under the information-based sequential design with $\lambda=2$ (top) and $\lambda=5$ (bottom). UNB performs better than ER and UCB with the increase of $\Delta$.
  • Figure 5: Empirical probability density of $p$-values under $H_0$ based on 2000 Monte Carlo samplings on the semi-synthetic real dataset. The red dashed line denotes the uniform distribution $U[0,1]$. Like benchmark ER, UNB shows valid Type I error.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Theorem 3.1
  • Theorem 3.2
  • Lemma 3.3
  • Theorem 3.4
  • Corollary 3.5
  • Theorem 4.1: FCLT with Stable Convergence
  • Lemma 4.2
  • Theorem 4.3: FCLT under information fraction
  • Corollary 4.4
  • Corollary 4.5
  • ...and 3 more