Table of Contents
Fetching ...

Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning

Vittorio Giammarino, Ruiqi Ni, Ahmed H. Qureshi

TL;DR

This work tackles offline goal-conditioned reinforcement learning by addressing the fundamental challenge of estimating accurate goal-conditioned value functions from limited data. It introduces a physics-informed Eikonal regularizer that enforces a distance-like cost-to-go structure on the value function, derived from the Eikonal PDE and grounded in continuous-time optimal control. The regularizer is model-free, TD-compatible, and integrates seamlessly with Hierarchical Implicit Q-Learning to form Eik-HIQL, which achieves state-of-the-art results in large-scale navigation and trajectory stitching on OGbench, while remaining lightweight in computation. While offering clear benefits for navigation tasks, the approach shows limited gains in interactive, contact-rich domains, suggesting future work on task-adaptive speed profiles and modeling of contact dynamics to broaden applicability.

Abstract

Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a \emph{Physics-informed (Pi)} regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.

Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning

TL;DR

This work tackles offline goal-conditioned reinforcement learning by addressing the fundamental challenge of estimating accurate goal-conditioned value functions from limited data. It introduces a physics-informed Eikonal regularizer that enforces a distance-like cost-to-go structure on the value function, derived from the Eikonal PDE and grounded in continuous-time optimal control. The regularizer is model-free, TD-compatible, and integrates seamlessly with Hierarchical Implicit Q-Learning to form Eik-HIQL, which achieves state-of-the-art results in large-scale navigation and trajectory stitching on OGbench, while remaining lightweight in computation. While offering clear benefits for navigation tasks, the approach shows limited gains in interactive, contact-rich domains, suggesting future work on task-adaptive speed profiles and modeling of contact dynamics to broaden applicability.

Abstract

Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a \emph{Physics-informed (Pi)} regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.

Paper Structure

This paper contains 24 sections, 1 theorem, 26 equations, 14 figures, 4 tables, 2 algorithms.

Key Result

Proposition 4.1

Given the Hamiltonian $H(s, g, \nabla_s V(s, g))$, the following inequality holds where $c^*(s) = \inf_{a \in \mathcal{A}}c(s,a)$ and $F^*(s) = \sup_{a \in \mathcal{A}}||f(s,a)||$. In the special case in which $f(s,a) = a$, $||a||=1$ and $c(s,a)$ is constant over $||a||=1$ the Hamiltonian simplifies to

Figures (14)

  • Figure 1: Contour plots of the GCVF for antmaze-giant-navigate-v0 in park2024ogbench, learned after 100,000 training steps by our Physics-informed algorithm Eik-HIQL, and the standard HIQL. The plots are generated by varying the agent's center of mass $x$-$y$ coordinates while keeping all other states fixed. Recall that the policy $\pi$ is trained to move the agent in the direction that maximizes the GCVF. The effects of the Eikonal regularizer are evident in Fig. \ref{['fig:intro_pi_hiql_value_functions']}a, where the contour plot closely follows the maze structure, in contrast to Fig. \ref{['fig:intro_pi_hiql_value_functions']}b, where the learned GCVF ignores the maze structure.
  • Figure 2: Environments from OGbench park2024ogbench used in our experiments. These include a variety of goal-conditioned tasks spanning navigation and locomotion (e.g., pointmaze, antmaze, humanoidmaze), contact-rich locomotion (antsoccer), and contact-rich manipulation (cube, scene). The environments differ significantly in dynamics complexity, dimensionality, and task structure, providing a comprehensive testbed for evaluating Offline GCRL algorithms.
  • Figure 2: Complete comparison between Eik-HiQRL and the Offline GCRL baselines. Agents are trained for 100,000 steps on pointmaze tasks and 1 million steps on the remaining tasks, each using 10 seeds. The evaluation follows the methodology described in Table \ref{['tab:speed_ablation']}. We report the mean and standard deviation across seeds for the best evaluation achieved during training. Results within $95\%$ of the best value are written in bold, and rows are highlighted when the Eikonal regularizer improves performance by $100\%$ or more compared to the non-regularized HIQL performance.
  • Figure 3: Countour plots of the GCVF on antsoccer-medium-navigate-v0park2024ogbench, learned after 1 million steps by Eik-HIQL and HIQL respectively. These plots are generated following the same methodology in Fig. \ref{['fig:intro_pi_hiql_value_functions']}.
  • Figure 4: Fig. \ref{['fig:distance_obstacles']} illustrates the computation of the distance function $d(s)$ used in \ref{['eq:exp_speed']} and \ref{['eq:linear_speed']}. Let the state be represented by its spatial coordinates $s = (x,y) \in \mathbb{R}^2$, and let $\mathcal{O} = \{o_1, \ldots, o_M\}$ denote the set of obstacle coordinates in the maze. We define $d(s) = \min_{o \in \mathcal{O}} \| s - o \|_2$, i.e., the Euclidean distance from $s$ to the nearest obstacle. Fig. \ref{['fig:linear_speed_profile']} reports the resulting speed profile obtained using $S_{\text{lin}}(s)$ in \ref{['eq:linear_speed']} for the pointmaze-medium-navigate-v0 dataset.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Proposition 4.1
  • proof
  • Remark 4.2: Connection between HJB and Eikonal residuals
  • Remark 4.3: Why the Eikonal residual helps in Offline GCRL
  • proof