Table of Contents
Fetching ...

Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

Xuefeng Wang, Lei Zhang, Henglin Pu, Ahmed H. Qureshi, Husheng Li

TL;DR

A CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale and aligns value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that improves gradient fidelity.

Abstract

Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton--Jacobi--Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.

Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

TL;DR

A CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale and aligns value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that improves gradient fidelity.

Abstract

Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton--Jacobi--Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.

Paper Structure

This paper contains 35 sections, 4 theorems, 47 equations, 13 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

For all $x \in \mathcal{X}$, the value function $V(x)$ is the optimal solution to satisfy the following HJB PDEs: where optimal control input $u^* = \mathop{\mathrm{arg\,max}}\limits_{u \in \mathcal{U}}\mathcal{H}(x,\nabla_x V(x))$. The Hamiltonian $\mathcal{H}$ is defined as $\mathcal{H} = \nabla_x V(x)^{\!\top}\!f\bigl(x,u\bigr)+r\bigl(x,u\bigr)$.

Figures (13)

  • Figure 1: The performance of our CT-MARL and DT-MARL is compared on a continuous-time, two-agent coupled oscillator task. In the discrete-time setting, DT-MARL trained with MADDPG can achieve near-optimal performance. However, when transferred to the continuous-time domain, MADDPG suffers from significant bias and error, resulting in poor approximations. In contrast, CT-MARL yields smoother actions, higher rewards, and more accurate value approximations, closely aligning with the analytical LQR ground truth.
  • Figure 2: Performance across continuous-time multi-agent MuJoCo settings. The y-axis shows the mean cumulative reward.
  • Figure 3: $V$ and $\nabla_x V$ contour using VIP w/ VGI and w/o VGI in $d_1$-$d_2$ frame.
  • Figure 4: Performance across continuous-time MPE settings. The y-axis shows the mean cumulative reward.
  • Figure 5: VIP performance with ReLU and Tanh activation functions in MuJoCo and MPE settings.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Definition 1: Value Function of Multi-agent Systems
  • Lemma 3.1: HJB for Multi-agent Systems
  • Lemma 3.2: Instantaneous Advantage
  • Lemma 3.3: Policy Improvement
  • Definition 2: VGI Gradient Estimator
  • Theorem 3.4: Convergence of VGI
  • proof
  • proof
  • proof : Policy Improvement via State-Action Value Function
  • proof
  • ...and 1 more