Online Policy Learning and Inference by Matrix Completion

Congyuan Duan; Jingyang Li; Dong Xia

Online Policy Learning and Inference by Matrix Completion

Congyuan Duan, Jingyang Li, Dong Xia

TL;DR

This paper develops a covariate-free online policy learning framework by casting decision-making as a matrix completion bandit (MCB) with low-rank reward matrices. It combines an $\varepsilon$-greedy online gradient descent learner with a two-phase exploration schedule to achieve favorable regret $R_T=\widetilde{O}(T^{2/3})$ while enabling reliable online policy inference via IPW-based debiasing and asymptotic normality. The work provides detailed algorithmic guarantees, derives sharp entrywise error bounds, and demonstrates practical impact through SFpark pricing and supermarket discount data, showing improved performance and statistically valid inference. It advances the integration of online learning, high-dimensional matrix completion, and online statistical inference in covariate-free settings. The results illuminate the trade-offs between learning accuracy, regret, and inference efficiency in large-scale, adaptive decision problems.

Abstract

Is it possible to make online decisions when personalized covariates are unavailable? We take a collaborative-filtering approach for decision-making based on collective preferences. By assuming low-dimensional latent features, we formulate the covariate-free decision-making problem as a matrix completion bandit. We propose a policy learning procedure that combines an $\varepsilon$-greedy policy for decision-making with an online gradient descent algorithm for bandit parameter estimation. Our novel two-phase design balances policy learning accuracy and regret performance. For policy inference, we develop an online debiasing method based on inverse propensity weighting and establish its asymptotic normality. Our methods are applied to data from the San Francisco parking pricing project, revealing intriguing discoveries and outperforming the benchmark policy.

Online Policy Learning and Inference by Matrix Completion

TL;DR

This paper develops a covariate-free online policy learning framework by casting decision-making as a matrix completion bandit (MCB) with low-rank reward matrices. It combines an

-greedy online gradient descent learner with a two-phase exploration schedule to achieve favorable regret

while enabling reliable online policy inference via IPW-based debiasing and asymptotic normality. The work provides detailed algorithmic guarantees, derives sharp entrywise error bounds, and demonstrates practical impact through SFpark pricing and supermarket discount data, showing improved performance and statistically valid inference. It advances the integration of online learning, high-dimensional matrix completion, and online statistical inference in covariate-free settings. The results illuminate the trade-offs between learning accuracy, regret, and inference efficiency in large-scale, adaptive decision problems.

Abstract

-greedy policy for decision-making with an online gradient descent algorithm for bandit parameter estimation. Our novel two-phase design balances policy learning accuracy and regret performance. For policy inference, we develop an online debiasing method based on inverse propensity weighting and establish its asymptotic normality. Our methods are applied to data from the San Francisco parking pricing project, revealing intriguing discoveries and outperforming the benchmark policy.

Paper Structure (45 sections, 22 theorems, 224 equations, 12 figures, 2 tables, 3 algorithms)

This paper contains 45 sections, 22 theorems, 224 equations, 12 figures, 2 tables, 3 algorithms.

Introduction
Main contributions
Related works
Collaborative Filtering and Matrix Completion Bandits
Policy Learning and Regret Performance
epsilon-Greedy bandit algorithm with stochastic gradient descent
Regret performance
Policy Inference by Inverse Propensity Weighting
Debiasing
Asymptotic normality
Policy inference
Numerical Experiments
Statistical inference
Regret analysis
Real Data Analysis
...and 30 more sections

Key Result

Theorem 1

Suppose that the horizon $T\leq d_1^{100}$ and the initializations are incoherent, satisfying $\|\widehat{M}_{0,0}-M_0\|_{\rm F}+\|\widehat{M}_{1,0}-M_1\|_{\rm F}\leq c_0\lambda_{\min}$ for some small constant $c_0>0$. Fix some $\gamma\in [0,1)$, $\varepsilon\in(0, 1)$, and set $T_0:=C_0T^{1-\gamma} for some large constants $C_1, C_2>0$ depending on $C_0, c_0, c_1,c_2$ only. There exist constant

Figures (12)

Figure 1: Box plots of $\mathbbm{1}(a_t=1)/\pi_t\cdot\xi_t\langle \widehat{L}_1\widehat{L}_1^{\top}X_t\widehat{R}_1\widehat{R}_1^{\top},Q\rangle$ (left) and $\mathbbm{1}(a_t=0)/(1-\pi_t)\cdot\xi_t\langle \widehat{L}_0\widehat{L}_0^{\top}X_t\widehat{R}_0\widehat{R}_0^{\top},Q\rangle$ (right), grouped by samples $X_t$ from $\Omega_1$ and $\Omega_0$ in a simulation with $T=60,000$ and $T_0=20,000$. Specifically, arm 1 was assigned to 18,761 samples from $\Omega_1$ and 1,281 from $\Omega_0$, while arm 0 was assigned to 1,285 samples from $\Omega_1$ and 18,673 from $\Omega_0$. This demonstrates that most samples assigned to arm $i$ originate from $\Omega_i$. However, samples from $\Omega_{1-i}$ exhibit significantly higher variance, which primarily determines the variance of $\langle \widehat{M}_i,Q\rangle$.
Figure 2: Empirical distributions under $\gamma=1/3$ (top) and $\gamma=1/4$ (bottom) with $T=60,000$ and $T_0=20,000$. The red curve represents the p.d.f. of standard normal distributions.
Figure 3: Empirical distributions under $\gamma=1/3$ with $T=30,000$, $T_0=8,000$ (top), and $T_0=5,000$ (bottom). The red curve represents the p.d.f. of standard normal distributions.
Figure 4: Average empirical cumulative regret against $T^{2/3}$ for $\gamma=\frac{1}{3}$ (left) and $T^{3/4}$ for $\gamma=\frac{1}{4}$ (right) under 100 trials.
Figure 5: SFpark: Scree plots of eigenvalues against factor numbers for $\widehat{M}_{1,0}$ (left) and $\widehat{M}_{0,0}$ (right).
...and 7 more figures

Theorems & Definitions (40)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Corollary 1
Theorem B.1
Theorem B.2
Theorem B.3
Theorem B.4
Corollary B.1
...and 30 more

Online Policy Learning and Inference by Matrix Completion

TL;DR

Abstract

Online Policy Learning and Inference by Matrix Completion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (40)