q-Learning in Continuous Time

Yanwei Jia; Xun Yu Zhou

q-Learning in Continuous Time

Yanwei Jia, Xun Yu Zhou

TL;DR

This work develops a theory of q-learning in continuous time for entropy-regularized diffusion RL, introducing the little q-function $q(t,x,a)=\partial_t J(t,x;\pi) + H\left( t,x,a,\partial_x J(t,x;\pi),\partial^2_x J(t,x;\pi)\right) - \beta J(t,x;\pi)$ as the first-order Q-term and linking it to the Hamiltonian. A martingale-based framework yields on-/off-policy characterizations and enables q-learning algorithms that recover a continuous-time SARSA-like TD update or align with policy-gradient methods via Gibbs updates. The theory extends to ergodic tasks and is demonstrated on mean–variance portfolio selection and ergodic LQ control, including offline/off-policy settings. Overall, the time-discretization-free q-learning paradigm clarifies the role of the Hamiltonian in policy improvement and provides practical actor–critic algorithms with convergence guarantees when possible.

Abstract

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor-critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.

q-Learning in Continuous Time

TL;DR

This work develops a theory of q-learning in continuous time for entropy-regularized diffusion RL, introducing the little q-function

as the first-order Q-term and linking it to the Hamiltonian. A martingale-based framework yields on-/off-policy characterizations and enables q-learning algorithms that recover a continuous-time SARSA-like TD update or align with policy-gradient methods via Gibbs updates. The theory extends to ergodic tasks and is demonstrated on mean–variance portfolio selection and ergodic LQ control, including offline/off-policy settings. Overall, the time-discretization-free q-learning paradigm clarifies the role of the Hamiltonian in policy improvement and provides practical actor–critic algorithms with convergence guarantees when possible.

Abstract

Paper Structure (21 sections, 15 theorems, 129 equations, 4 figures, 3 tables, 5 algorithms)

This paper contains 21 sections, 15 theorems, 129 equations, 4 figures, 3 tables, 5 algorithms.

Introduction
Problem Formulation and Preliminaries
Classical model-based formulation
Exploratory formulation in reinforcement learning
Some useful preliminary results
q-Function in Continuous Time: The Theory
Q-function
q-function
Optimal q-function
q-Learning Algorithms When Normalizing Constant Is Available
q-Learning algorithms
Connections with SARSA
q-Learning Algorithms When Normalizing Constant Is Unavailable
A stronger policy improvement theorem
Connections with policy gradient
...and 6 more sections

Key Result

Theorem 2

For any given $\bm{\pi}\in \bm{\Pi}$, define $\bm\pi'(\cdot|t,x)\propto \exp\{ \frac{1}{\gamma}H( t,x,\cdot,\frac{\partial J}{\partial x}(t,x;\bm\pi), \frac{\partial^2 J}{\partial x^2}(t,x;\bm\pi) ) \}$. If $\bm\pi'\in \bm\Pi$, then Moreover, if the following map has a fixed point $\bm\pi^*$, then $\bm\pi^*$ is the optimal policy.

Figures (4)

Figure 1: Running average rewards of three RL algorithms. A single state trajectory is generated with length $T = 10^6$ and discretized at $\Delta t=0.1$ to which three online algorithms apply: "Policy Gradient" described in Algorithm 3 in jia2021policypg, "Q-Learning" described in Appendix B, and "q-Learning" described in Algorithm \ref{['algo:ergodic incremental']}. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation. Two dashed horizontal lines are respectively the omniscient optimal average reward without exploration when the model parameters are known and the omniscient optimal average reward less the exploration cost.
Figure 2: Paths of learned parameters of three RL algorithms. A single state trajectory is generated with length $T = 10^6$ and discretized at $\Delta t=0.1$ to which three online algorithms apply: "Policy Gradient" described in Algorithm 3 in jia2021policypg, "Q-Learning" described in Appendix B, and "q-Learning" described in Algorithm \ref{['algo:ergodic incremental']}. All the policies are restricted to be in the parametric form of $\bm\pi^{\psi}(\cdot|x) = \mathcal{N}(\psi_1 x + \psi_2, e^{\psi_3})$. The omniscient optimal policy is $\psi_1^*\approx-0.354$, $\psi_2^*\approx-0.708,e^{\psi_3^*}\approx 0.035$, shown in the dashed line. We repeat the experiments for 100 times for each method and plot as the shaded area the standard deviation of the learned parameters. The width of each shaded area is twice the corresponding standard deviation.
Figure 3: Running average rewards of three RL algorithms with different time discretization sizes. A single state trajectory is generated with length $T = 10^5$ and discretized at different step sizes: $\Delta t=1$ in (a), $\Delta t=0.1$ in (b), and $\Delta t=0.01$ in (c). For each step size, we apply three online algorithms: "Policy Gradient" described in Algorithm 3 in jia2021policypg, "Q-Learning" described in Appendix B, and "q-Learning" described in Algorithm \ref{['algo:ergodic incremental']}. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation.
Figure 4: Paths of learned parameters of three RL algorithms with different time discretization sizes. A single state trajectory is generated with length $T = 10^6$ and discretized at different step sizes: $\Delta t=1$ in (a), $\Delta t=0.1$ in (b), and $\Delta t=0.01$ in (c), under the behavior policy $a_t\sim \mathcal{N}(0,1)$. From top to bottom are the paths of learned $\psi_1,\psi_2,e^{\psi_3}$ respectively. For each step size, we apply three algorithms: "Policy Gradient" described in Algorithm 3 in jia2021policypg, "Q-Learning" described in Appendix B, and "q-Learning" described in Algorithm \ref{['algo:ergodic incremental']}. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation.

Theorems & Definitions (18)

Definition 1
Theorem 2
Proposition 3
Definition 4
Corollary 5
Theorem 6
Theorem 7
Proposition 8
Theorem 9
Example 1
...and 8 more

q-Learning in Continuous Time

TL;DR

Abstract

q-Learning in Continuous Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (18)