Online Policy Learning and Inference by Matrix Completion
Congyuan Duan, Jingyang Li, Dong Xia
TL;DR
This paper develops a covariate-free online policy learning framework by casting decision-making as a matrix completion bandit (MCB) with low-rank reward matrices. It combines an $\varepsilon$-greedy online gradient descent learner with a two-phase exploration schedule to achieve favorable regret $R_T=\widetilde{O}(T^{2/3})$ while enabling reliable online policy inference via IPW-based debiasing and asymptotic normality. The work provides detailed algorithmic guarantees, derives sharp entrywise error bounds, and demonstrates practical impact through SFpark pricing and supermarket discount data, showing improved performance and statistically valid inference. It advances the integration of online learning, high-dimensional matrix completion, and online statistical inference in covariate-free settings. The results illuminate the trade-offs between learning accuracy, regret, and inference efficiency in large-scale, adaptive decision problems.
Abstract
Is it possible to make online decisions when personalized covariates are unavailable? We take a collaborative-filtering approach for decision-making based on collective preferences. By assuming low-dimensional latent features, we formulate the covariate-free decision-making problem as a matrix completion bandit. We propose a policy learning procedure that combines an $\varepsilon$-greedy policy for decision-making with an online gradient descent algorithm for bandit parameter estimation. Our novel two-phase design balances policy learning accuracy and regret performance. For policy inference, we develop an online debiasing method based on inverse propensity weighting and establish its asymptotic normality. Our methods are applied to data from the San Francisco parking pricing project, revealing intriguing discoveries and outperforming the benchmark policy.
