A random measure approach to reinforcement learning in continuous time

Christian Bender; Nguyen Tran Thuan

A random measure approach to reinforcement learning in continuous time

Christian Bender, Nguyen Tran Thuan

TL;DR

This work develops a random measure framework for RL in continuous time, modeling exploration via grid-based randomized policy execution and reformulating the resulting dynamics as SDEs driven by random measures. It proves a grid-size→0 limit theorem showing vague convergence to a grid-sampling limit SDE driven by white-noise martingale measures and a Poisson random measure, providing a principled basis for analyzing exploratory control and deriving learning algorithms. The approach clarifies connections to earlier exploratory SDEs, highlights differences in multi-control convergence, and supports continuous-time TD-style learning schemes grounded in the limit dynamics. The results offer a rigorous bridge between measure-valued policy execution and practical learning methods in continuous-time RL with jumps and diffusion.

Abstract

We present a random measure approach for modeling exploration, i.e., the execution of measure-valued controls, in continuous-time reinforcement learning (RL) with controlled diffusion and jumps. First, we consider the case when sampling the randomized control in continuous time takes place on a discrete-time grid and reformulate the resulting stochastic differential equation (SDE) as an equation driven by suitable random measures. The construction of these random measures makes use of the Brownian motion and the Poisson random measure (which are the sources of noise in the original model dynamics) as well as the additional random variables, which are sampled on the grid for the control execution. Then, we prove a limit theorem for these random measures as the mesh-size of the sampling grid goes to zero, which leads to the grid-sampling limit SDE that is jointly driven by white noise random measures and a Poisson random measure. We also argue that the grid-sampling limit SDE can substitute the exploratory SDE and the sample SDE of the recent continuous-time RL literature, i.e., it can be applied for the theoretical analysis of exploratory control problems and for the derivation of learning algorithms.

A random measure approach to reinforcement learning in continuous time

TL;DR

Abstract

Paper Structure (25 sections, 12 theorems, 164 equations)

This paper contains 25 sections, 12 theorems, 164 equations.

Introduction
Motivation and discussion of the main result
Controlled SDEs with randomized policies
Random measure interpretation of grid-sampling
Limit theorem and grid-sampling limit SDE
Comparison to the exploratory SDE of WZZ20
Outlook: Towards learning
Proof of \ref{['thm:limit']}
Preliminaries
Proof of assertion (\ref{['converge-prob-error']})
Proof of assertion (\ref{['eq:weak-convergence-rho']})
Background on martingale measures and proofs for Subsection \ref{['sec:discrete_sample_random_measure']}
Background on martingale measures
Proofs for \ref{['sec:discrete_sample_random_measure']}
Proof of \ref{['lem:Bm_Pi']}
...and 10 more sections

Key Result

Lemma 2.4

A measurable random field $Y \colon \Omega\times [0,T]\times [0,1]^d\rightarrow \mathbb R$ is integrable with respect to $M_D^\Pi$, if and only if In this case, a.s.,

Theorems & Definitions (29)

Remark 2.1
Remark 2.2
Remark 2.3
Lemma 2.4
Lemma 2.5
Lemma 2.6
Theorem 2.7
Remark 2.8
Remark 2.9
Remark 2.10
...and 19 more

A random measure approach to reinforcement learning in continuous time

TL;DR

Abstract

A random measure approach to reinforcement learning in continuous time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (29)