Table of Contents
Fetching ...

A Pontryagin Method of Model-based Reinforcement Learning via Hamiltonian Actor-Critic

Chengyang Gu, Yuxin Pan, Hui Xiong, Yize Chen

Abstract

Model-based reinforcement learning (MBRL) improves sample efficiency by leveraging learned dynamics models for policy optimization. However, the effectiveness of methods such as actor-critic is often limited by compounding model errors, which degrade long-horizon value estimation. Existing approaches, such as Model-Based Value Expansion (MVE), partially mitigate this issue through multi-step rollouts, but remain sensitive to rollout horizon selection and residual model bias. Motivated by the Pontryagin Maximum Principle (PMP), we propose Hamiltonian Actor-Critic (HAC), a model-based approach that eliminates explicit value function learning by directly optimizing a Hamiltonian defined over the learned dynamics and reward for deterministic systems. By avoiding value approximation, HAC reduces sensitivity to model errors while admitting convergence guarantees. Extensive experiments on continuous control benchmarks, in both online and offline RL settings, demonstrate that HAC outperforms model-free and MVE-based baselines in control performance, convergence speed, and robustness to distributional shift, including out-of-distribution (OOD) scenarios. In offline settings with limited data, HAC matches or exceeds state-of-the-art methods, highlighting its strong sample efficiency.

A Pontryagin Method of Model-based Reinforcement Learning via Hamiltonian Actor-Critic

Abstract

Model-based reinforcement learning (MBRL) improves sample efficiency by leveraging learned dynamics models for policy optimization. However, the effectiveness of methods such as actor-critic is often limited by compounding model errors, which degrade long-horizon value estimation. Existing approaches, such as Model-Based Value Expansion (MVE), partially mitigate this issue through multi-step rollouts, but remain sensitive to rollout horizon selection and residual model bias. Motivated by the Pontryagin Maximum Principle (PMP), we propose Hamiltonian Actor-Critic (HAC), a model-based approach that eliminates explicit value function learning by directly optimizing a Hamiltonian defined over the learned dynamics and reward for deterministic systems. By avoiding value approximation, HAC reduces sensitivity to model errors while admitting convergence guarantees. Extensive experiments on continuous control benchmarks, in both online and offline RL settings, demonstrate that HAC outperforms model-free and MVE-based baselines in control performance, convergence speed, and robustness to distributional shift, including out-of-distribution (OOD) scenarios. In offline settings with limited data, HAC matches or exceeds state-of-the-art methods, highlighting its strong sample efficiency.

Paper Structure

This paper contains 40 sections, 5 theorems, 71 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Lemma D.1

In the actor network learning objective Eq. eq:optimization, updating $a_t$ with $a_t \leftarrow a_t - \eta \nabla_{a_t} H(s_t, a_t, \lambda_{t+1})$ is equivalent to taking gradient descent of $J = \sum_{t=0}^{T-1} c(s_t, a_t) + \Phi_T(s_T)$ with respect to $a_t$ : $a_t \leftarrow a_t - \eta \nabla_

Figures (4)

  • Figure 2: Learning Curves Comparison (Online RL): We compare our HAC with DDPG, SAC and MVE-DDPG on Pendulum, MountainCar, Swimmer and Hopper. For MVE-DDPG and HAC which require multi-step imaginary rollouts, we set the rollout horizon to $K=10$ for Pendulum to cover all control horizon, $K=5$ for MountainCar and Swimmer, and $K=3$ for Hopper. All results are averaged over 5 random seeds.
  • Figure 3: Learning Curves Comparison (Offline RL). We compare our HAC with IQL, SAC-Off and MOPO on LQR, Pendulum, and MountainCar, Swimmer. For MOPO and HAC which require multi-step imaginary rollouts, rollout horizon is to $K=10$ for LQR and Pendulum to cover all control horizon, $K=5$ for MountainCar and Swimmer. Results are averaged over 5 random seeds.
  • Figure 4: Robustness comparison between MVE-DDPG and proposed HAC against initial state shift. The experiment is conducted on a 10-step LQR task with state dimension 5 and action dimension 3. Plots (a) and (b) illustrate optimal trajectories generated by HAC and MVE-DDPG in dimension 0 (averaged on 10 random seeds) under non-shifted and shifted initial states.
  • Figure 5: Critic Estimation Comparison: Plots comparing normalized estimated Q-values (in MVE-DDPG) and Hamiltonian (in our HAC) against ground-truth sampled from 10 random test episodes in LQR. An ideal critic should concentrates on the red unity line.

Theorems & Definitions (10)

  • Lemma D.1
  • Theorem D.2
  • Lemma D.3
  • proof
  • Theorem D.4
  • proof
  • proof
  • proof
  • Lemma 2.1
  • proof