Table of Contents
Fetching ...

Online Statistical Inference for Contextual Bandits via Stochastic Gradient Descent

Xiangyu Chang, Xi Chen, Zehua Lai, He Li, Zhihong Liu, Yichen Zhang

TL;DR

This work develops a general online statistical inference framework for contextual bandits via adaptive weighted SGD, enabling fully online updating and uncertainty quantification. It establishes asymptotic normality for the averaged SGD estimator with covariance H^−1SH^−1 under broad weighting and policy schemes, and provides a Bahadur representation that highlights slower convergence due to adaptive data collection. The paper also provides online plug-in methods to estimate the limiting covariance, analyzes optimal weighting in linear regression, and extends to non-smooth losses such as quantile regression. It offers practical guidance through two policies (modified ε-greedy and exponential) and validates the theory with simulations and a Yahoo! real-data study, showing reliable, narrow confidence intervals in online decision-making contexts.

Abstract

With the fast development of big data, learning the optimal decision rule by recursively updating it and making online decisions has been easier than before. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for an online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.

Online Statistical Inference for Contextual Bandits via Stochastic Gradient Descent

TL;DR

This work develops a general online statistical inference framework for contextual bandits via adaptive weighted SGD, enabling fully online updating and uncertainty quantification. It establishes asymptotic normality for the averaged SGD estimator with covariance H^−1SH^−1 under broad weighting and policy schemes, and provides a Bahadur representation that highlights slower convergence due to adaptive data collection. The paper also provides online plug-in methods to estimate the limiting covariance, analyzes optimal weighting in linear regression, and extends to non-smooth losses such as quantile regression. It offers practical guidance through two policies (modified ε-greedy and exponential) and validates the theory with simulations and a Yahoo! real-data study, showing reliable, narrow confidence intervals in online decision-making contexts.

Abstract

With the fast development of big data, learning the optimal decision rule by recursively updating it and making online decisions has been easier than before. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for an online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.
Paper Structure (44 sections, 17 theorems, 175 equations, 59 figures, 6 tables)

This paper contains 44 sections, 17 theorems, 175 equations, 59 figures, 6 tables.

Key Result

Theorem 3.1

Under Assumption assum:bound to Assumption assum:tv, the averaged SGD estimator $\bar{\theta}_t=t^{-1} \sum_{s=0}^{t-1} \theta_s$ converges to $\theta^*$ almost surely when $t \rightarrow \infty$ and where $\theta_s$ is updated in eq:weighted-sgd with step size $\eta_t = \eta_0 t^{-\alpha}$, $\eta_0>0$ and $\alpha\in(1/2,1)$, $H=\nabla^2 \mathcal{L}_{\theta^*}(\theta^*)$ and $S = \mathbb{E}[\xi_{

Figures (59)

  • Figure 1: SGD on a non-degenerate linear regression model with the modified $\varepsilon$-greedy policy using different weight schemes. We report the empirical distribution of each action's first dimension of $\sqrt{t}(\bar{\theta}_t - \theta^*)$ based on $10,000$ Monte-Carlo simulations. We plot the density of a zero-mean normal distribution that matches the second-order moments.
  • Figure 2: SGD on a degenerate linear regression model with the modified $\varepsilon$-greedy policy using different weight schemes. We report the empirical distribution of each action's first dimension of $\sqrt{t}(\bar{\theta}_t - \theta^*)$ based on $10,000$ Monte-Carlo simulations. We plot the density of a zero-mean normal distribution that matches the second-order moments.
  • Figure 3: SGD on linear regression with sqrt-IPW in the near-degenerate model. We report the empirical coverage rate and its corresponding 95% CI length.
  • Figure 4: Empirical distribution transitions across different $T$ values for vanilla, Arm 0.
  • Figure B.1: SGD on linear regression with modified $\varepsilon$-greedy and different weights in the non-degenerate model. We report the empirical distribution of each action's first dimension of $\sqrt{t}\widehat{S}_t^{-1/2}\widehat{H}_t(\bar{\theta}_t - \theta^*)$ for $10,000$ Monte-Carlo simulations.
  • ...and 54 more figures

Theorems & Definitions (40)

  • Example 2.1: Linear Regression
  • Example 2.2: Quantile Regression
  • Example 2.3: Logistic Regression
  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.3
  • Proposition 3.4
  • Corollary 3.5
  • Corollary 4.1
  • Remark 4.2
  • ...and 30 more