Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

Congyuan Duan; Wanteng Ma; Jiashuo Jiang; Dong Xia

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

Congyuan Duan, Wanteng Ma, Jiashuo Jiang, Dong Xia

TL;DR

This work addresses the challenge of simultaneous regret minimization and statistical inference in online decision making with high-dimensional covariates by marrying an online HT estimator to an $\varepsilon$-greedy policy and debiasing via inverse propensity weighting. It reveals a fundamental trade-off between exploration-induced regret and inference efficiency, governed by the exploration decay rate $\gamma$ and the margin parameter $\nu$, with regret scaling as $R_T\lesssim R_{\max}T^{1-\gamma}+\dots$ and inference variance scaling as $O(T^{-(1-\gamma)})$. Under a covariate-diversity condition, the authors propose an exploration-free scheme using average weighting (AW) that achieves both $O(\log T)$ regret and $\sqrt{T}$-consistent inference, and they extend the framework to inference on the optimal policy's value $V^*$, estiamted via $\hat V_T = T^{-1}\sum_{t} y_t$. Numerical experiments on simulations and Warfarin dosing data validate accurate, calibrated inference for parameters and the optimal value, and illustrate practical gains in high-dimensional personalized decision making. Overall, the paper provides a principled framework for balancing learning and uncertainty quantification in online high-dimensional settings, with concrete guidance for kernelized online debiasing and covariate-diversity-driven exploration strategies.

Abstract

This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the $\varepsilon$-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either $O(T^{1/2})$ regret or classical $O(T^{1/2})$-consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal $O(\log T)$ regret and $O(T^{1/2})$-consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy's value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

TL;DR

This work addresses the challenge of simultaneous regret minimization and statistical inference in online decision making with high-dimensional covariates by marrying an online HT estimator to an

-greedy policy and debiasing via inverse propensity weighting. It reveals a fundamental trade-off between exploration-induced regret and inference efficiency, governed by the exploration decay rate

and the margin parameter

, with regret scaling as

and inference variance scaling as

. Under a covariate-diversity condition, the authors propose an exploration-free scheme using average weighting (AW) that achieves both

regret and

-consistent inference, and they extend the framework to inference on the optimal policy's value

, estiamted via

. Numerical experiments on simulations and Warfarin dosing data validate accurate, calibrated inference for parameters and the optimal value, and illustrate practical gains in high-dimensional personalized decision making. Overall, the paper provides a principled framework for balancing learning and uncertainty quantification in online high-dimensional settings, with concrete guidance for kernelized online debiasing and covariate-diversity-driven exploration strategies.

Abstract

This paper investigates regret minimization, statistical inference, and their interplay in high-dimensional online decision-making based on the sparse linear context bandit model. We integrate the

-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters and introduce an inference framework based on a debiasing method using inverse propensity weighting. Under a margin condition, our method achieves either

regret or classical

-consistent inference, indicating an unavoidable trade-off between exploration and exploitation. If a diverse covariate condition holds, we demonstrate that a pure-greedy bandit algorithm, i.e., exploration-free, combined with a debiased estimator based on average weighting can simultaneously achieve optimal

regret and

-consistent inference. We also show that a simple sample mean estimator can provide valid inference for the optimal policy's value. Numerical simulations and experiments on Warfarin dosing data validate the effectiveness of our methods.

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

TL;DR

Abstract

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (50)