Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

Wenlong Mou

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

Wenlong Mou

TL;DR

The paper tackles off-policy continuous-time RL with function approximation for controlled diffusion processes, addressing instability caused by Bellman backups under standard norms. By exploiting uniform ellipticity, it introduces a Sobolev-proximal scheme that decouples value and advantage learning into a Sobolev-regularized update for the value function and a least-squares refinement for the advantage, all within projected Bellman fixed points. The authors establish non-asymptotic oracle inequalities: the estimation error is bounded by the best approximation error in Sobolev-based function classes, the localized complexities of those classes, exponentially decaying optimization error, and a discretization error that scales as $\sqrt{\\eta}$; they also provide concrete rates for parametric and nonparametric function classes. This framework shows that, under ellipticity, model-free RL with function approximation can achieve guarantees on par with supervised learning, offering a principled path toward scalable, stable continuous-time RL in diffusion settings.

Abstract

We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted $q$-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

TL;DR

; they also provide concrete rates for parametric and nonparametric function classes. This framework shows that, under ellipticity, model-free RL with function approximation can achieve guarantees on par with supervised learning, offering a principled path toward scalable, stable continuous-time RL in diffusion settings.

Abstract

-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.

Paper Structure (57 sections, 25 theorems, 266 equations)

This paper contains 57 sections, 25 theorems, 266 equations.

Introduction
Contributions:
Notation:
Related work
Continuous-time RL:
RL with function approximation:
Learning for elliptic PDEs:
Problem setup
Control protocol and MDP formulation
Value functions and Bellman equation
Observation model
From projected fixed points to the algorithm
Population-level projected fixed-point equation
Projected fixed-point equations:
Population-level iterates
...and 42 more sections

Key Result

Proposition 1

Under suitable regularity conditions on the drift function $b$, diffusion matrix function $\Lambda$, and policy $\pi$, there exists a constant $c_{\mathrm{discr}} > 0$ depending on these regularity parameters, such that for any test function $g \in C_{\mathrm{lin}}^{4}$, we have jia2025accuracy did

Theorems & Definitions (25)

Proposition 1: Theorem 4.1 of jia2025accuracy
Theorem 1
Theorem 2
Theorem 3
Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Lemma 6
...and 15 more

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

TL;DR

Abstract

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (25)