Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations
William Sharpless, Dylan Hirsch, Sander Tonkens, Nikhil Shinde, Sylvia Herbert
TL;DR
The paper develops dual-objective reinforcement learning formulations grounded in Hamilton-Jacobi theory, introducing Reach-Always-Avoid (RAA) and Reach-Reach (RR) value functions and proving their Bellman decompositions into simpler subproblems. It shows that augmented MDPs—with trajectory-history tracking—are sufficient for optimality and presents DOHJ-PPO, a PPO-based algorithm that solves the decomposed value functions via coupled, on-policy learning augmented with SRBE/SRABE. The approach outperforms Lagrangian-based and existing HJ-RL baselines on safety-focused arrival and multi-target tasks, including stochastic dynamics. This work provides a practical, theoretically grounded route to balanced dual-objective control in complex RL settings, with potential extensions to richer temporal logic specifications.
Abstract
Hard constraints in reinforcement learning (RL) often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: 1) the Reach-Always-Avoid (RAA) problem -- of achieving distinct reward and penalty thresholds -- and 2) the Reach-Reach (RR) problem -- of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context via decomposition. Specifically, we prove that the RAA and RR problems may be rewritten as compositions of previously studied HJ-RL problems. We leverage our analysis to propose a variation of Proximal Policy Optimization (DOHJ-PPO), and demonstrate that it produces distinct behaviors from previous approaches, outcompeting a number of baselines in success, safety and speed across a range of tasks for safe-arrival and multi-target achievement.
