Table of Contents
Fetching ...

Operator Models for Continuous-Time Offline Reinforcement Learning

Nicolas Hoischen, Petar Bevanda, Max Beier, Stefan Sosnowski, Boris Houska, Sandra Hirche

TL;DR

This work addresses offline reinforcement learning in continuous-time systems by formulating an operator-based framework that learns an infinitesimal generator in an RKHS and solves a Hamilton–Jacobi–Bellman equation via a simple dynamic-programming recursion. The proposed O-CTRL algorithm separates world-model learning from policy optimization, provides explicit operator-norm error bounds, and delivers end-to-end convergence guarantees that relate data size, smoothness, and stability to suboptimality. Numerical experiments on linear and nonlinear SDEs and Pendulum-Gym illustrate the method’s learning rates and competitive performance with offline baselines, highlighting the practicality of operator theory for continuous-time offline RL. Overall, the paper demonstrates that operator-based approaches can yield rigorous, scalable guarantees for offline decision-making in complex, continuous-time environments.

Abstract

Continuous-time stochastic processes underlie many natural and engineered systems. In healthcare, autonomous driving, and industrial control, direct interaction with the environment is often unsafe or impractical, motivating offline reinforcement learning from historical data. However, there is limited statistical understanding of the approximation errors inherent in learning policies from offline datasets. We address this by linking reinforcement learning to the Hamilton-Jacobi-Bellman equation and proposing an operator-theoretic algorithm based on a simple dynamic programming recursion. Specifically, we represent our world model in terms of the infinitesimal generator of controlled diffusion processes learned in a reproducing kernel Hilbert space. By integrating statistical learning methods and operator theory, we establish global convergence of the value function and derive finite-sample guarantees with bounds tied to system properties such as smoothness and stability. Our theoretical and numerical results indicate that operator-based approaches may hold promise in solving offline reinforcement learning using continuous-time optimal control.

Operator Models for Continuous-Time Offline Reinforcement Learning

TL;DR

This work addresses offline reinforcement learning in continuous-time systems by formulating an operator-based framework that learns an infinitesimal generator in an RKHS and solves a Hamilton–Jacobi–Bellman equation via a simple dynamic-programming recursion. The proposed O-CTRL algorithm separates world-model learning from policy optimization, provides explicit operator-norm error bounds, and delivers end-to-end convergence guarantees that relate data size, smoothness, and stability to suboptimality. Numerical experiments on linear and nonlinear SDEs and Pendulum-Gym illustrate the method’s learning rates and competitive performance with offline baselines, highlighting the practicality of operator theory for continuous-time offline RL. Overall, the paper demonstrates that operator-based approaches can yield rigorous, scalable guarantees for offline decision-making in complex, continuous-time environments.

Abstract

Continuous-time stochastic processes underlie many natural and engineered systems. In healthcare, autonomous driving, and industrial control, direct interaction with the environment is often unsafe or impractical, motivating offline reinforcement learning from historical data. However, there is limited statistical understanding of the approximation errors inherent in learning policies from offline datasets. We address this by linking reinforcement learning to the Hamilton-Jacobi-Bellman equation and proposing an operator-theoretic algorithm based on a simple dynamic programming recursion. Specifically, we represent our world model in terms of the infinitesimal generator of controlled diffusion processes learned in a reproducing kernel Hilbert space. By integrating statistical learning methods and operator theory, we establish global convergence of the value function and derive finite-sample guarantees with bounds tied to system properties such as smoothness and stability. Our theoretical and numerical results indicate that operator-based approaches may hold promise in solving offline reinforcement learning using continuous-time optimal control.

Paper Structure

This paper contains 53 sections, 13 theorems, 89 equations, 6 figures, 5 tables.

Key Result

proposition 1

Let $\widehat{r}_\bx=\widehat{S}_{\mathsf{S}}^{*}\,\bm{r}, \widehat{\mathcal{D}}_\bu = \widehat{S}_{\mathsf{S}}^{*}\bm{D}_\bu(\cdot)$, where $\widehat{S}_{\mathsf{S}}^{*}:\mathbb{R}^{N}\to\spIN$ is the adjoint of the sampling operator $\widehat{S}_{\mathsf{S}}$. Let the transition dynamics be descri with where $\ES_{\mathsf{S}}\featx(\bx)=\bm{k}_{\mathsf S}(\bx)$ is the sampled canonical map, $\S

Figures (6)

  • Figure 1: Overview of the $\textsf{O-CTRL}$ algorithm: a generator world model based on an RKHS representation of state–action data enables dynamic programming for optimal value functions, illustrated on the swing-up pendulum task from Gymnasiumtowers2024gymnasium.
  • Figure 2: Value function learning. Left: A linear SDE with quadratic costs. Right: Convergence for a nonlinear SDE with quadratic costs. The error convergence confirms our worst-case analysis.
  • Figure 3: Error decomposition diagram for bounding the term $\norm{\Vstar - \EV_{k}}_{L_{{\mu}}^2}$
  • Figure 4: Comparison of learned ($200$ datapoints) and reference (ground truth) value function and policy linear SDE with additive action, demonstrating that we can effectively learn unknown value functions and policies without parametric assumptions.
  • Figure 5: Comparison of learned ($200$ datapoints) and reference (ground truth as $\epsilon$ goes to zero) value function for the nonlinear SDE with affine action, demonstrating that we can effectively learn unknown value functions and policies without parametric assumptions.
  • ...and 1 more figures

Theorems & Definitions (30)

  • proposition 1
  • corollary 1
  • theorem 1
  • proof : Proof sketch.
  • corollary 2
  • proposition 2: Fenchel Lipschitzness
  • proof
  • remark 1: Smooth Approximations
  • remark 2: Explicit Fenchel conjugate
  • remark 3
  • ...and 20 more