LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration
Ruiyu Qiu, Rui Wang, Guanghui Yang, Xiang Li, Zhijiang Shao
TL;DR
The paper tackles lexicographic multi-objective RL in continuous action spaces by reframing lexicographic updates as a convex projection problem over the intersection of higher-priority constraint half-spaces. It introduces LPPG-RL, which computes a feasible update direction $d^*$ by solving $\min_d \|d - g_M\|^2$ subject to $d \in \mathcal{C}_M$, using a light Dykstra projection loop, and augments this with Subproblem Exploration to ensure balanced subtask learning. The approach is instantiated with LPPG-PPO and LPPG-SAC, and is shown to achieve strict priority satisfaction, improved stability, and higher efficiency than state-of-the-art continuous LMORL baselines in Nav2D experiments. Theoretical analysis provides convergence guarantees for the two-timescale actor-critic setting and a concrete lower bound on policy improvement per update, while experiments demonstrate up to 20x faster gradient projection for small problems and robust performance under perturbations. Overall, LPPG-RL removes reliance on manual threshold tuning, extends lexicographic optimization to continuous domains, and offers practical advantages for safety- and priority-sensitive RL tasks.
Abstract
Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra's projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.
