Table of Contents
Fetching ...

LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

Ruiyu Qiu, Rui Wang, Guanghui Yang, Xiang Li, Zhijiang Shao

TL;DR

The paper tackles lexicographic multi-objective RL in continuous action spaces by reframing lexicographic updates as a convex projection problem over the intersection of higher-priority constraint half-spaces. It introduces LPPG-RL, which computes a feasible update direction $d^*$ by solving $\min_d \|d - g_M\|^2$ subject to $d \in \mathcal{C}_M$, using a light Dykstra projection loop, and augments this with Subproblem Exploration to ensure balanced subtask learning. The approach is instantiated with LPPG-PPO and LPPG-SAC, and is shown to achieve strict priority satisfaction, improved stability, and higher efficiency than state-of-the-art continuous LMORL baselines in Nav2D experiments. Theoretical analysis provides convergence guarantees for the two-timescale actor-critic setting and a concrete lower bound on policy improvement per update, while experiments demonstrate up to 20x faster gradient projection for small problems and robust performance under perturbations. Overall, LPPG-RL removes reliance on manual threshold tuning, extends lexicographic optimization to continuous domains, and offers practical advantages for safety- and priority-sensitive RL tasks.

Abstract

Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra's projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.

LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

TL;DR

The paper tackles lexicographic multi-objective RL in continuous action spaces by reframing lexicographic updates as a convex projection problem over the intersection of higher-priority constraint half-spaces. It introduces LPPG-RL, which computes a feasible update direction by solving subject to , using a light Dykstra projection loop, and augments this with Subproblem Exploration to ensure balanced subtask learning. The approach is instantiated with LPPG-PPO and LPPG-SAC, and is shown to achieve strict priority satisfaction, improved stability, and higher efficiency than state-of-the-art continuous LMORL baselines in Nav2D experiments. Theoretical analysis provides convergence guarantees for the two-timescale actor-critic setting and a concrete lower bound on policy improvement per update, while experiments demonstrate up to 20x faster gradient projection for small problems and robust performance under perturbations. Overall, LPPG-RL removes reliance on manual threshold tuning, extends lexicographic optimization to continuous domains, and offers practical advantages for safety- and priority-sensitive RL tasks.

Abstract

Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra's projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.

Paper Structure

This paper contains 36 sections, 5 theorems, 24 equations, 12 figures, 7 tables, 4 algorithms.

Key Result

Theorem 1

Consider an LMORL problem with subtask set $\mathcal{K} = \{K_1,\cdots, K_M\}$ and LPPG-RL algorithm in a general actor-critic framework. Let $\theta_t$ be the actor parameters and $\phi_t$ be the multi–head critic parameters. Assume, Then, we have our parameters converge to a local or global lexicographic optimum $(\theta^*, \phi^*(\theta^*))$ with a stepsize $\alpha_\theta \leq \min\left\{2g_i^

Figures (12)

  • Figure 1: The overall workflow of LPPG-RL. The agent consists of a policy network and a multi-head critic network, where the policy network learns the global optimal lexicographic action. The lower part is LPPG. Policy gradients are calculated each time with rollouts, and a subproblem is drawn to get a feasible update direction so that each subtask can be trained uniformly.
  • Figure 2: Illustration of the optimal solution of Equation \ref{['eq:min d g']} with three gradients. The blue, red and green vectors represent the policy gradients $g_1,g_2,g_3$, ordered from high to low priority, serving as the normal vectors of the respective half-spaces. The intersection of these half-spaces forms a cone. The optimal solution, shown as the black arrow $d^*$, is the vector within the intersection $\mathcal{C}$ that is closest to $g_3$.
  • Figure 3: 1 Goal experiment in the 2D navigation environment. \ref{['fig:single goal traj']} shows the map and agent trajectories under 50 Monte-Carlo simulations with different seeds. \ref{['fig:single goal traj 95 estimated']} displays the 95% statistical confidence corridor of trajectories under $\mathcal{N}(0, 0.1)$ Gaussian noise applied to the state. \ref{['fig:single goal a distribution']} illustrates of the deterministic policy direction for different agent locations in the map, where the length of arrows corresponds to the action magnitude.
  • Figure 4: 2 Goal experiment in the 2D navigation environment. The start region is initialized as before and two symmetric goal regions are designated as different priority subtasks. Figure \ref{['fig:double_goal']} and Figure \ref{['fig:double_goal_reversed']} show the trajectories under 50 Monte-Carlo simulations with two different subtask priority configurations. The green goal region has higher priority in Figure \ref{['fig:double_goal']} while it has lower priority in Figure \ref{['fig:double_goal_reversed']}. As shown in Figure \ref{['fig:double_goal_returns_compare']}, a comparison of the returns for the two goal regions under two different priority configurations is presented, demonstrating that our method strictly preserve lexicographic priorities. The legend "G$\rightarrow$R" and "R$\rightarrow$G" correspond to Figure \ref{['fig:double_goal']} and Figure \ref{['fig:double_goal_reversed']}, respectively.
  • Figure 5: Training snapshots of Nav2D-1G environment from 10k steps to 1M steps. Blue region is the start region, green goal region is the only target.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Remark 1
  • Lemma 1: Actor-Critic convergence
  • Lemma 2: Lexicographic feasibility
  • Remark 2
  • Lemma 3: Policy update bound