$ε$-Policy Gradient for Online Pricing

Lukasz Szpruch; Tanut Treetanthiploet; Yufei Zhang

$ε$-Policy Gradient for Online Pricing

Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang

TL;DR

The paper tackles online pricing under contextual uncertainty by bridging model-based inference of the customer response with model-free policy gradient updates. It models the response distribution ν with a parametric π_θ, estimates θ via regularized empirical risk minimization, and updates prices through a gradient step on the estimated expected reward, augmented with ε-exploration to ensure learnability. The main results establish high-probability bounds on the parameter estimation error and prove an overall regret bound of order $\mathcal{O}(\sqrt{T})$ up to logarithmic factors, achieved by carefully balancing exploration and gradient-based exploitation. This work advances sample-efficient online pricing by integrating model-based inference into policy gradient methods, offering fast adaptation to changing rewards and scalable learning in high-dimensional context-action spaces.

Abstract

Combining model-based and model-free reinforcement learning approaches, this paper proposes and analyzes an $ε$-policy gradient algorithm for the online pricing learning task. The algorithm extends $ε$-greedy algorithm by replacing greedy exploitation with gradient descent step and facilitates learning via model inference. We optimize the regret of the proposed algorithm by quantifying the exploration cost in terms of the exploration probability $ε$ and the exploitation cost in terms of the gradient descent optimization and gradient estimation errors. The algorithm achieves an expected regret of order $\mathcal{O}(\sqrt{T})$ (up to a logarithmic factor) over $T$ trials.

$ε$-Policy Gradient for Online Pricing

TL;DR

up to logarithmic factors, achieved by carefully balancing exploration and gradient-based exploitation. This work advances sample-efficient online pricing by integrating model-based inference into policy gradient methods, offering fast adaptation to changing rewards and scalable learning in high-dimensional context-action spaces.

Abstract

Combining model-based and model-free reinforcement learning approaches, this paper proposes and analyzes an

-policy gradient algorithm for the online pricing learning task. The algorithm extends

-greedy algorithm by replacing greedy exploitation with gradient descent step and facilitates learning via model inference. We optimize the regret of the proposed algorithm by quantifying the exploration cost in terms of the exploration probability

and the exploitation cost in terms of the gradient descent optimization and gradient estimation errors. The algorithm achieves an expected regret of order

(up to a logarithmic factor) over

trials.

Paper Structure (13 sections, 8 theorems, 60 equations, 1 algorithm)

This paper contains 13 sections, 8 theorems, 60 equations, 1 algorithm.

Problem formulation
Main results
Estimation of response distribution
$\epsilon$-policy gradient algorithm and its regret
Proofs of main results
Proof of Theorem \ref{['thm:statistical error']}
Proof of Theorem \ref{['Thm: convergence_epsilon']}
Proof of Theorem \ref{['Thm: expected_regret']}
Proofs of technical results
Proofs of Example \ref{['example:negative log loss']} and Proposition \ref{['prop:glm_explore']}
Proof of Example \ref{['ex:ls']}
Proof of Lemma \ref{['lemma:sub-exponential condition']}
Proof of Lemma \ref{['lemma:regret to action']}

Key Result

Theorem 2.1

Suppose (H.assum:pi_theta) and (H.assum:loss) hold. For all $T\in {\mathbb{N}}$ and all $\delta >0$, where $V_T \coloneqq \sum_{t=1}^T H(x_t, a_t) + 2 I_{d}$.

Theorems & Definitions (21)

Example 2.1
Remark 2.1
Example 2.2
Remark 2.2
Theorem 2.1
Proposition 2.2
Theorem 2.3
Theorem 2.4
Lemma 3.1
proof
...and 11 more

$ε$-Policy Gradient for Online Pricing

TL;DR

Abstract

$ε$-Policy Gradient for Online Pricing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (21)