Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

Minh Nguyen

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

Minh Nguyen

TL;DR

A novel decoupled continuous-time actor-critic algorithm with alternating updates that outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter, nearly doubling the second-best method.

Abstract

Many real-world control problems, ranging from finance to robotics, evolve in continuous time with non-uniform, event-driven decisions. Standard discrete-time reinforcement learning (RL), based on fixed-step Bellman updates, struggles in this setting: as time gaps shrink, the $Q$-function collapses to the value function $V$, eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function $q$. However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle $V$ and $q$ into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates: $q$ is learned from diffusion generators on $V$, and $V$ is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter$-$nearly doubling the second-best method.

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

TL;DR

Abstract

-function collapses to the value function

, eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function

. However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle

and

into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates:

is learned from diffusion generators on

, and

is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter

nearly doubling the second-best method.

Paper Structure (82 sections, 40 theorems, 265 equations, 11 figures, 9 tables, 6 algorithms)

This paper contains 82 sections, 40 theorems, 265 equations, 11 figures, 9 tables, 6 algorithms.

Introduction
Preliminaries
Discrete-time MDPs to continuous-time control
Stochastic control formulation
Generators and Hamiltonians
Instantaneous $q$-function in continuous time
Formulation
Challenges in continuous-time RL
Continuous-time actor--critic via Hamiltonian flow
Step 1: Updating $q$ from a fixed $V$.
Step 2: Updating $V$ from $q$ is not a discrete-time max step.
Final algorithm: a single critic via $Q \approx V+q$
Theoretical analysis
Proof sketch.
Proof sketch.
...and 67 more sections

Key Result

Theorem 4.1

Fix $\alpha\ge 0$ and $\tau>0$. Under Assumptions assumption:dynamics, assumption:reward, and assumption:test-functions, let $V_k$ being the value function after $k$ iterations $V_{k+1} := T_\tau^{(\alpha)}(V_k)$. Then, $V_k$ converges to $V^{(\alpha)}$ as $\tau \to 0$, and for all $k\ge 0$, Here $C_0$ and $C_1$ are constants independent of $k$.

Figures (11)

Figure 1: Evaluation returns for 4 continuous-time algorithms including our CT-SAC and CT-TD3 on control tasks over 12 seeds.
Figure 2: Evaluation returns for our CT-SAC against 4 discrete-time algorithms on control tasks over 12 seeds.
Figure 3: Two-week evaluation returns for CT-SAC and CT-TD3 (Ours) against 6 other algorithms on trading tasks.
Figure 4: Evaluation returns for CT-SAC against SAC and its reward-shaping version on control tasks over 12 seeds.
Figure 5: Evaluation returns for CT-SAC against SAC and its reward-shaping version on trading task over 12 seeds.
...and 6 more figures

Theorems & Definitions (74)

Theorem 4.1: Value Function Convergence
Theorem 4.2: Convergence with $q$-function
Theorem 4.3: Theoretical Algorithm Convergence
Corollary 4.3: Algorithm Convergence
Lemma 3.4: Dynkin's formula for controlled diffusions
proof
Lemma 3.5: Small-time moment bounds
proof : Proof sketch
Lemma 3.6: Entropy-regularized expectation via a KL shift
proof
...and 64 more

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

TL;DR

Abstract

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (74)