Table of Contents
Fetching ...

A characterization of sample adaptivity in UCB data

Yilun Chen, Jiaqi Lu

TL;DR

This work investigates the statistical properties of data generated by UCB-type algorithms in a stochastic two-armed bandit, focusing on sample adaptivity—the correlation between arm pulls and empirical rewards. It develops a novel perturbation-based framework around a fluid UCB fixed point to prove a joint central limit theorem for pull counts and arm means, with a scaling that smoothly bridges large-gap (standard) and small-gap (slow-concentration) regimes. The main result yields a nonstandard CLT for the number of pulls and, consequently, a distributional characterization of pseudo-regret, while also revealing a leading-order sample bias arising from pull-mean coupling. A stylized data-generating model and accompanying numerics provide intuition and partial validation for the bias predictions, highlighting implications for downstream statistical inference and the design of exploration functions.

Abstract

We characterize a joint CLT of the number of pulls and the sample mean reward of the arms in a stochastic two-armed bandit environment under UCB algorithms. Several implications of this result are in place: (1) a nonstandard CLT of the number of pulls hence pseudo-regret that smoothly interpolates between a standard form in the large arm gap regime and a slow-concentration form in the small arm gap regime, and (2) a heuristic derivation of the sample bias up to its leading order from the correlation between the number of pulls and sample means. Our analysis framework is based on a novel perturbation analysis, which is of broader interest on its own.

A characterization of sample adaptivity in UCB data

TL;DR

This work investigates the statistical properties of data generated by UCB-type algorithms in a stochastic two-armed bandit, focusing on sample adaptivity—the correlation between arm pulls and empirical rewards. It develops a novel perturbation-based framework around a fluid UCB fixed point to prove a joint central limit theorem for pull counts and arm means, with a scaling that smoothly bridges large-gap (standard) and small-gap (slow-concentration) regimes. The main result yields a nonstandard CLT for the number of pulls and, consequently, a distributional characterization of pseudo-regret, while also revealing a leading-order sample bias arising from pull-mean coupling. A stylized data-generating model and accompanying numerics provide intuition and partial validation for the bias predictions, highlighting implications for downstream statistical inference and the design of exploration functions.

Abstract

We characterize a joint CLT of the number of pulls and the sample mean reward of the arms in a stochastic two-armed bandit environment under UCB algorithms. Several implications of this result are in place: (1) a nonstandard CLT of the number of pulls hence pseudo-regret that smoothly interpolates between a standard form in the large arm gap regime and a slow-concentration form in the small arm gap regime, and (2) a heuristic derivation of the sample bias up to its leading order from the correlation between the number of pulls and sample means. Our analysis framework is based on a novel perturbation analysis, which is of broader interest on its own.

Paper Structure

This paper contains 30 sections, 17 theorems, 101 equations, 4 figures, 1 algorithm.

Key Result

Lemma 3.1

Let $(n^{\star}_{1, T}, n^{\star}_{2, T})$ be the unique solution of Eq. eq: fluid equation 2 arm f. Denote by $\lambda^{\star} \triangleq \lim_{T \to \infty}\frac{n^{\star}_{2, T}}{n^{\star}_{1, T}}$. The scaling of $(n^{\star}_{1, T}, n^{\star}_{2, T})$ and $\lambda^{\star}$ can be explicitly spec

Figures (4)

  • Figure 1: The empirical distribution of the sample mean of arm 2's reward under UCB1 ($f(t) = \sqrt{\rho\log T}$ with $\rho = 2$) when the horizon length $T=10^5$, with $10^5$ repetitions. Arm $i$'s reward distribution is $\mathcal{N}(\mu_i, 1), i = 1, 2$ (with $\mu_1=\mu_2=0$). The sample mean $\bar{\mu}_{2,T}$ from each repetition is standardized as in \ref{['eq:sample_mean_CLT_intro']}, i.e., scaled by $\sqrt{n^\star_{2,T}}=\sqrt{T/2}$. The normal pdf curve matches the first two moments of the empirical distribution of the scaled sample means.
  • Figure 2: Scaled empirical bias of arm 2 under 10000 repetitions, $(\hat{\mu}_2(T)-\mu_2)\sqrt{T\log T}$, versus scaled (by $\sqrt{T\log T}$) conjectured bias of arm 2 in Conjecture \ref{['conjecture:sampling-bias-UCB']}, $\sigma_2^2$, for different horizon length $T$ We fix $\mu_1=\mu_2=1$ and $\rho=2$, and vary the values of $\sigma_1=\sigma_2$, represented by each curve.
  • Figure 3: Scaled empirical bias of arm 2 under 10000 repetitions, $(\hat{\mu}_2(T)-\mu_2)\log T$, versus scaled (by $\log T$) conjectured bias of arm 2 in Conjecture \ref{['conjecture:sampling-bias-UCB']}, $\sigma_2^2(\mu_1-\mu_2)$, for different horizon length $T$ We fix $\mu_1=2,\sigma_1=\sigma_2=1$ and $\rho=2$, and vary the value of $\mu_2$, represented by each curve.
  • Figure 4: Scaled empirical bias of arm 2 under 10000 repetitions, $(\hat{\mu}_2(T)-\mu_2)\sqrt{T\log T}$, versus scaled (by $\sqrt{T\log T}$) conjectured bias of arm 2 in Conjecture \ref{['conjecture:sampling-bias-UCB']} for different horizon length $T$ We fix $\mu_2=0, \sigma_1=\sigma_2=1$ and $\rho=2$, and vary the values of $\theta$ in $\mu_1=\sqrt{\frac{\theta\log T}{T}}$, represented by each curve.

Theorems & Definitions (24)

  • Conjecture 1.1: Informal
  • Lemma 3.1: Fluid Scaling
  • Theorem 3.1: Joint CLT
  • Corollary 4.1
  • Proposition 4.2
  • Conjecture 4.3
  • Lemma A.1
  • Corollary A.2
  • Lemma B.1
  • Lemma B.2: Lyapunov CLT for triangular arrays
  • ...and 14 more