Table of Contents
Fetching ...

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

Congyuan Duan, Wanteng Ma, Jiashuo Jiang, Dong Xia

TL;DR

This work addresses the challenge of simultaneous regret minimization and statistical inference in online decision making with high-dimensional covariates by marrying an online HT estimator to an $\varepsilon$-greedy policy and debiasing via inverse propensity weighting. It reveals a fundamental trade-off between exploration-induced regret and inference efficiency, governed by the exploration decay rate $\gamma$ and the margin parameter $\nu$, with regret scaling as $R_T\lesssim R_{\max}T^{1-\gamma}+\dots$ and inference variance scaling as $O(T^{-(1-\gamma)})$. Under a covariate-diversity condition, the authors propose an exploration-free scheme using average weighting (AW) that achieves both $O(\log T)$ regret and $\sqrt{T}$-consistent inference, and they extend the framework to inference on the optimal policy's value $V^*$, estiamted via $\hat V_T = T^{-1}\sum_{t} y_t$. Numerical experiments on simulations and Warfarin dosing data validate accurate, calibrated inference for parameters and the optimal value, and illustrate practical gains in high-dimensional personalized decision making. Overall, the paper provides a principled framework for balancing learning and uncertainty quantification in online high-dimensional settings, with concrete guidance for kernelized online debiasing and covariate-diversity-driven exploration strategies.

Abstract

This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the $\varepsilon$-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either $O(T^{1/2})$ regret or classical $O(T^{1/2})$-consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal $O(\log T)$ regret and $O(T^{1/2})$-consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy's value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

TL;DR

This work addresses the challenge of simultaneous regret minimization and statistical inference in online decision making with high-dimensional covariates by marrying an online HT estimator to an -greedy policy and debiasing via inverse propensity weighting. It reveals a fundamental trade-off between exploration-induced regret and inference efficiency, governed by the exploration decay rate and the margin parameter , with regret scaling as and inference variance scaling as . Under a covariate-diversity condition, the authors propose an exploration-free scheme using average weighting (AW) that achieves both regret and -consistent inference, and they extend the framework to inference on the optimal policy's value , estiamted via . Numerical experiments on simulations and Warfarin dosing data validate accurate, calibrated inference for parameters and the optimal value, and illustrate practical gains in high-dimensional personalized decision making. Overall, the paper provides a principled framework for balancing learning and uncertainty quantification in online high-dimensional settings, with concrete guidance for kernelized online debiasing and covariate-diversity-driven exploration strategies.

Abstract

This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the -greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either regret or classical -consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal regret and -consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy's value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.

Paper Structure

This paper contains 45 sections, 24 theorems, 197 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

Suppose Assumption assump:basic holds. Let $\varepsilon_t:=c_{\varepsilon}t^{-\gamma}$ for some constant $c_{\varepsilon}$ and $\gamma\in [0,1)$, the step size $\eta:=(4\kappa_{*} \phi_{\max}(s))^{-1}$ in Algorithm alg:onlineHT, where $\kappa_{*}$ is the condition number $\kappa_{*}:=\phi_{\max}(s)/

Figures (6)

  • Figure 1: Point and interval estimators of the first ten entries in $\beta_1$(top) and $\beta_0$(bottom) under scenario (1). The red points indicate the true value.
  • Figure 2: Point and interval estimators of the first ten entries in $\beta_1$(top) and $\beta_0$(bottom) under scenario (2). The red points indicate the true value.
  • Figure 3: Optimal value estimation by our online estimator("Online") under scenario (1)(left panel) and (2)(right panel). The shaded region is the constructed 95% confidence interval. "Oracle" denotes the average of the true optimal value of each collected data.
  • Figure 4: Comparison of the point and interval estimators of the significant variables in $\beta_0$(top), $\beta_1$(middle) and $\beta_2$(bottom).
  • Figure 5: Comparison of the point and interval estimators of the significant variables in $\beta_0-\beta_1$, $\beta_0-\beta_2$ and $\beta_1-\beta_2$ under $\varepsilon_t=5t^{-1/3}$(top) and $\varepsilon_t=0$(bottom).
  • ...and 1 more figures

Theorems & Definitions (50)

  • Remark 1
  • Theorem 1
  • Theorem 2
  • Remark 2
  • Theorem 3
  • Theorem 4
  • Corollary 1
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • ...and 40 more