Table of Contents
Fetching ...

Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations

William Sharpless, Dylan Hirsch, Sander Tonkens, Nikhil Shinde, Sylvia Herbert

TL;DR

The paper develops dual-objective reinforcement learning formulations grounded in Hamilton-Jacobi theory, introducing Reach-Always-Avoid (RAA) and Reach-Reach (RR) value functions and proving their Bellman decompositions into simpler subproblems. It shows that augmented MDPs—with trajectory-history tracking—are sufficient for optimality and presents DOHJ-PPO, a PPO-based algorithm that solves the decomposed value functions via coupled, on-policy learning augmented with SRBE/SRABE. The approach outperforms Lagrangian-based and existing HJ-RL baselines on safety-focused arrival and multi-target tasks, including stochastic dynamics. This work provides a practical, theoretically grounded route to balanced dual-objective control in complex RL settings, with potential extensions to richer temporal logic specifications.

Abstract

Hard constraints in reinforcement learning (RL) often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: 1) the Reach-Always-Avoid (RAA) problem -- of achieving distinct reward and penalty thresholds -- and 2) the Reach-Reach (RR) problem -- of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context via decomposition. Specifically, we prove that the RAA and RR problems may be rewritten as compositions of previously studied HJ-RL problems. We leverage our analysis to propose a variation of Proximal Policy Optimization (DOHJ-PPO), and demonstrate that it produces distinct behaviors from previous approaches, outcompeting a number of baselines in success, safety and speed across a range of tasks for safe-arrival and multi-target achievement.

Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations

TL;DR

The paper develops dual-objective reinforcement learning formulations grounded in Hamilton-Jacobi theory, introducing Reach-Always-Avoid (RAA) and Reach-Reach (RR) value functions and proving their Bellman decompositions into simpler subproblems. It shows that augmented MDPs—with trajectory-history tracking—are sufficient for optimality and presents DOHJ-PPO, a PPO-based algorithm that solves the decomposed value functions via coupled, on-policy learning augmented with SRBE/SRABE. The approach outperforms Lagrangian-based and existing HJ-RL baselines on safety-focused arrival and multi-target tasks, including stochastic dynamics. This work provides a practical, theoretically grounded route to balanced dual-objective control in complex RL settings, with potential extensions to richer temporal logic specifications.

Abstract

Hard constraints in reinforcement learning (RL) often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: 1) the Reach-Always-Avoid (RAA) problem -- of achieving distinct reward and penalty thresholds -- and 2) the Reach-Reach (RR) problem -- of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context via decomposition. Specifically, we prove that the RAA and RR problems may be rewritten as compositions of previously studied HJ-RL problems. We leverage our analysis to propose a variation of Proximal Policy Optimization (DOHJ-PPO), and demonstrate that it produces distinct behaviors from previous approaches, outcompeting a number of baselines in success, safety and speed across a range of tasks for safe-arrival and multi-target achievement.

Paper Structure

This paper contains 19 sections, 5 theorems, 20 equations, 5 figures.

Key Result

Theorem 1

For all initial states $s \in \mathcal{s}$, where $r_{\textup{RAA}}(s) := \min\left\{ r(s), V_{\textup{A}}^*(s) \right\}$, with

Figures (5)

  • Figure 1: Depiction of the Reach-Always-Avoid (RAA) and Reach-Reach (RR) Tasks. In the RAA tasks, the zero-level set of the rewards (goals) and penalties (obstacles) are depicted in green and red respectively, while in the RR problem, the zero-level set of the two rewards (two goals) are depicted in green and blue. The RAA value is defined by the minimum of the minimum penalty and maximum reward, inducing the agents to enter the goals at some time without ever entering the obstacles. The RR value is defined by the minimum of the two maximum rewards, inducing the agents to enter both goals at some time.
  • Figure 2: DQN Grid-World Demonstration of the RAA & RR Problems. We compare our novel formulations with previous HJ-RL formulations (RA & R) in a simple grid-world problem with DQN. The zero-level sets of $q$ (hazards) are highlighted in red, those of $r$ (goals) in blue, and trajectories in black (starting at the dot). In both models, the agents actions are limited to {left, right, straight} and the system flows upwards over time.
  • Figure 3: Examples where a Non-Augmented Policy is Flawed. In both MDPs, consider an agent with no memory. (Left) For a deterministic policy based on the current state, the agent can only achieve one target (RR), as this policy must associate the middle state with either of the two possible actions. (Right) The RAA case is slightly more complex. Assume the robot will make sure to avoid the fire at all costs (which is easily done from the current state). It would also prefer to not encounter the cone hazard, but will do so if needed to achieve the target. From its current state the robot cannot determine whether to pursue the target by crossing the cone or move to the right. The correct decision depends on state history, specifically on whether the robot has already reached the target state or not (e.g. imagine the initial state is on the target state).
  • Figure 4: Success ($\rightarrow$) and Partial Success ($\rightarrow$) in RAA and RR Tasks for DOHJ-PPO and Baselines. We evaluate DOHJ-PPO in black against baselines over 1,000 trajectories in the Hopper, F16, SafetyGym and HalfCheetah environments. In the first and third row, the Partial Success percentage of each algorithm is given, defined by the number of trajectories to achieve one objective (reaching or always-avoiding in the RAA, reaching either in the RR). In the second and fourth rows, SUCCESS percentage is given, defined by the number of trajectories to achieve both objectives. Most baselines achieve partial success, however, few achieve total success as the environment becomes more difficult, underscoring the difficulty of balancing objectives in RL.
  • Figure 5: Success ($\uparrow$) in the HalfCheetah RAA and RR Tasks with Increasingly Stochastic Dynamics. We plot the learning trajectories of DOHJ-PPO in black and the top baselines for the HalfCheetah environment with an affine Gaussian noise added to the dynamics. Task achievement (success) is given by the percentage of 256 trajectories that either reach the target and always-avoid the obstacles or reach both targets (corresponding to $V_{\mathrm{RAA}} > 0$ and $V_{\mathrm{RR}} > 0$). Each column corresponds to a different scale of noise -- null, low (0.5), moderate (1.) and high (2.) -- which is added to the velocities and angular velocities of the HalfCheetah dynamics. In the RAA task, DOHJ-PPO outperforms all baselines up to the highest noise settings where all algorithms perform equivalently poorly. In the RR task, DOHJ-PPO outperforms all algorithms significantly. In summary, this ablation demonstrates the robustness of DOHJ-PPO to certain stochasticity in the dynamics and the validity of the SRBE and SRABE approximations.

Theorems & Definitions (5)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Theorem 3