q-Learning in Continuous Time
Yanwei Jia, Xun Yu Zhou
TL;DR
This work develops a theory of q-learning in continuous time for entropy-regularized diffusion RL, introducing the little q-function $q(t,x,a)=\partial_t J(t,x;\pi) + H\left( t,x,a,\partial_x J(t,x;\pi),\partial^2_x J(t,x;\pi)\right) - \beta J(t,x;\pi)$ as the first-order Q-term and linking it to the Hamiltonian. A martingale-based framework yields on-/off-policy characterizations and enables q-learning algorithms that recover a continuous-time SARSA-like TD update or align with policy-gradient methods via Gibbs updates. The theory extends to ergodic tasks and is demonstrated on mean–variance portfolio selection and ergodic LQ control, including offline/off-policy settings. Overall, the time-discretization-free q-learning paradigm clarifies the role of the Hamiltonian in policy improvement and provides practical actor–critic algorithms with convergence guarantees when possible.
Abstract
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor-critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.
