Table of Contents
Fetching ...

Regret of exploratory policy improvement and $q$-learning

Wenpin Tang, Xun Yu Zhou

TL;DR

Under suitable conditions on the growth and regularity of the model parameters, this work provides a quantitative error and regret analysis of both the exploratory policy improvement algorithm and the $q$-learning algorithm.

Abstract

We study the convergence of $q$-learning and related algorithms introduced by Jia and Zhou (J. Mach. Learn. Res., 24 (2023), 161) for controlled diffusion processes. Under suitable conditions on the growth and regularity of the model parameters, we provide a quantitative error and regret analysis of both the exploratory policy improvement algorithm and the $q$-learning algorithm.

Regret of exploratory policy improvement and $q$-learning

TL;DR

Under suitable conditions on the growth and regularity of the model parameters, this work provides a quantitative error and regret analysis of both the exploratory policy improvement algorithm and the -learning algorithm.

Abstract

We study the convergence of -learning and related algorithms introduced by Jia and Zhou (J. Mach. Learn. Res., 24 (2023), 161) for controlled diffusion processes. Under suitable conditions on the growth and regularity of the model parameters, we provide a quantitative error and regret analysis of both the exploratory policy improvement algorithm and the -learning algorithm.

Paper Structure

This paper contains 13 sections, 13 theorems, 101 equations, 1 figure.

Key Result

Theorem 3.2

Let Assumption assump hold, and fix $\eta \in (0,1)$. There exist $L, C> 0$ (independent of $\gamma$ and $n$) such that where $\theta(\gamma):=\beta + (1+\eta^{-1}) L^2\left( 1+ e^{\frac{L}{\gamma}} + \frac{1}{\gamma} e^{\frac{L}{\gamma}} \right)^2$.

Figures (1)

  • Figure 1: Plot of $\phi \to -\frac{\phi}{2}\left(\frac{1}{\phi^2} - \frac{e^\phi}{(e^\phi-1)^2} \right)$.

Theorems & Definitions (23)

  • Theorem 3.2
  • Lemma 3.3
  • Lemma 3.4
  • proof
  • Lemma 3.5
  • proof
  • Lemma 3.6
  • proof
  • proof : Proof of Theorem \ref{['thm:cvrate']}
  • Theorem 4.2
  • ...and 13 more