Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates
Congyuan Duan, Wanteng Ma, Jiashuo Jiang, Dong Xia
TL;DR
This work addresses the challenge of simultaneous regret minimization and statistical inference in online decision making with high-dimensional covariates by marrying an online HT estimator to an $\varepsilon$-greedy policy and debiasing via inverse propensity weighting. It reveals a fundamental trade-off between exploration-induced regret and inference efficiency, governed by the exploration decay rate $\gamma$ and the margin parameter $\nu$, with regret scaling as $R_T\lesssim R_{\max}T^{1-\gamma}+\dots$ and inference variance scaling as $O(T^{-(1-\gamma)})$. Under a covariate-diversity condition, the authors propose an exploration-free scheme using average weighting (AW) that achieves both $O(\log T)$ regret and $\sqrt{T}$-consistent inference, and they extend the framework to inference on the optimal policy's value $V^*$, estiamted via $\hat V_T = T^{-1}\sum_{t} y_t$. Numerical experiments on simulations and Warfarin dosing data validate accurate, calibrated inference for parameters and the optimal value, and illustrate practical gains in high-dimensional personalized decision making. Overall, the paper provides a principled framework for balancing learning and uncertainty quantification in online high-dimensional settings, with concrete guidance for kernelized online debiasing and covariate-diversity-driven exploration strategies.
Abstract
This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the $\varepsilon$-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either $O(T^{1/2})$ regret or classical $O(T^{1/2})$-consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal $O(\log T)$ regret and $O(T^{1/2})$-consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy's value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.
