Table of Contents
Fetching ...

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

Minh Nguyen

TL;DR

A novel decoupled continuous-time actor-critic algorithm with alternating updates that outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter, nearly doubling the second-best method.

Abstract

Many real-world control problems, ranging from finance to robotics, evolve in continuous time with non-uniform, event-driven decisions. Standard discrete-time reinforcement learning (RL), based on fixed-step Bellman updates, struggles in this setting: as time gaps shrink, the $Q$-function collapses to the value function $V$, eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function $q$. However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle $V$ and $q$ into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates: $q$ is learned from diffusion generators on $V$, and $V$ is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter$-$nearly doubling the second-best method.

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

TL;DR

A novel decoupled continuous-time actor-critic algorithm with alternating updates that outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter, nearly doubling the second-best method.

Abstract

Many real-world control problems, ranging from finance to robotics, evolve in continuous time with non-uniform, event-driven decisions. Standard discrete-time reinforcement learning (RL), based on fixed-step Bellman updates, struggles in this setting: as time gaps shrink, the -function collapses to the value function , eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function . However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle and into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates: is learned from diffusion generators on , and is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarternearly doubling the second-best method.
Paper Structure (82 sections, 40 theorems, 265 equations, 11 figures, 9 tables, 6 algorithms)

This paper contains 82 sections, 40 theorems, 265 equations, 11 figures, 9 tables, 6 algorithms.

Key Result

Theorem 4.1

Fix $\alpha\ge 0$ and $\tau>0$. Under Assumptions assumption:dynamics, assumption:reward, and assumption:test-functions, let $V_k$ being the value function after $k$ iterations $V_{k+1} := T_\tau^{(\alpha)}(V_k)$. Then, $V_k$ converges to $V^{(\alpha)}$ as $\tau \to 0$, and for all $k\ge 0$, Here $C_0$ and $C_1$ are constants independent of $k$.

Figures (11)

  • Figure 1: Evaluation returns for 4 continuous-time algorithms including our CT-SAC and CT-TD3 on control tasks over 12 seeds.
  • Figure 2: Evaluation returns for our CT-SAC against 4 discrete-time algorithms on control tasks over 12 seeds.
  • Figure 3: Two-week evaluation returns for CT-SAC and CT-TD3 (Ours) against 6 other algorithms on trading tasks.
  • Figure 4: Evaluation returns for CT-SAC against SAC and its reward-shaping version on control tasks over 12 seeds.
  • Figure 5: Evaluation returns for CT-SAC against SAC and its reward-shaping version on trading task over 12 seeds.
  • ...and 6 more figures

Theorems & Definitions (74)

  • Theorem 4.1: Value Function Convergence
  • Theorem 4.2: Convergence with $q$-function
  • Theorem 4.3: Theoretical Algorithm Convergence
  • Corollary 4.3: Algorithm Convergence
  • Lemma 3.4: Dynkin's formula for controlled diffusions
  • proof
  • Lemma 3.5: Small-time moment bounds
  • proof : Proof sketch
  • Lemma 3.6: Entropy-regularized expectation via a KL shift
  • proof
  • ...and 64 more