$ε$-Policy Gradient for Online Pricing
Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang
TL;DR
The paper tackles online pricing under contextual uncertainty by bridging model-based inference of the customer response with model-free policy gradient updates. It models the response distribution ν with a parametric π_θ, estimates θ via regularized empirical risk minimization, and updates prices through a gradient step on the estimated expected reward, augmented with ε-exploration to ensure learnability. The main results establish high-probability bounds on the parameter estimation error and prove an overall regret bound of order $\mathcal{O}(\sqrt{T})$ up to logarithmic factors, achieved by carefully balancing exploration and gradient-based exploitation. This work advances sample-efficient online pricing by integrating model-based inference into policy gradient methods, offering fast adaptation to changing rewards and scalable learning in high-dimensional context-action spaces.
Abstract
Combining model-based and model-free reinforcement learning approaches, this paper proposes and analyzes an $ε$-policy gradient algorithm for the online pricing learning task. The algorithm extends $ε$-greedy algorithm by replacing greedy exploitation with gradient descent step and facilitates learning via model inference. We optimize the regret of the proposed algorithm by quantifying the exploration cost in terms of the exploration probability $ε$ and the exploitation cost in terms of the gradient descent optimization and gradient estimation errors. The algorithm achieves an expected regret of order $\mathcal{O}(\sqrt{T})$ (up to a logarithmic factor) over $T$ trials.
