Table of Contents
Fetching ...

Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning

Finn Rietz, Erik Schaffernicht, Stefan Heinrich, Johannes Andreas Stork

TL;DR

This work addresses lexicographic multi-objective reinforcement learning in continuous action spaces by introducing a subtask transformation and a decomposed learning framework. The method, prioritized soft Q-decomposition (PSQD), scalarizes the lexicographic problem via $Q_{\succ}(\mathbf{s},\mathbf{a}) = \sum_{i=1}^{n-1} \ln(c_i(\mathbf{s},\mathbf{a})) + Q_n(\mathbf{s},\mathbf{a})$, with transformed rewards $r_{\succ i}=\ln(c_i)$ and $r_{\succ n}=r_n$, enabling incremental learning and reuse of higher-priority subtasks. The paper provides subtask-view and arbiter-view theoretical analyses, along with a practical algorithm for continuous spaces that can perform offline adaptation using retained data. Empirical results on a 2D navigation task and a high-dimensional Franka Panda control task show that PSQD preserves lexicographic priorities in zero-shot and online adaptation, outperforming baselines that fail to respect priorities and demonstrating data-efficient reuse of subtask solutions. The method offers interpretability through decomposed subtask components and suggests a principled approach for tackling complex RL problems via modular task composition and reuse.

Abstract

Reinforcement learning (RL) for complex tasks remains a challenge, primarily due to the difficulties of engineering scalar reward functions and the inherent inefficiency of training models from scratch. Instead, it would be better to specify complex tasks in terms of elementary subtasks and to reuse subtask solutions whenever possible. In this work, we address continuous space lexicographic multi-objective RL problems, consisting of prioritized subtasks, which are notoriously difficult to solve. We show that these can be scalarized with a subtask transformation and then solved incrementally using value decomposition. Exploiting this insight, we propose prioritized soft Q-decomposition (PSQD), a novel algorithm for learning and adapting subtask solutions under lexicographic priorities in continuous state-action spaces. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. Its ability to use retained subtask training data for offline learning eliminates the need for new environment interaction during adaptation. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. PSQD provides an intuitive framework for tackling complex RL problems, offering insights into the inner workings of the subtask composition.

Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning

TL;DR

This work addresses lexicographic multi-objective reinforcement learning in continuous action spaces by introducing a subtask transformation and a decomposed learning framework. The method, prioritized soft Q-decomposition (PSQD), scalarizes the lexicographic problem via , with transformed rewards and , enabling incremental learning and reuse of higher-priority subtasks. The paper provides subtask-view and arbiter-view theoretical analyses, along with a practical algorithm for continuous spaces that can perform offline adaptation using retained data. Empirical results on a 2D navigation task and a high-dimensional Franka Panda control task show that PSQD preserves lexicographic priorities in zero-shot and online adaptation, outperforming baselines that fail to respect priorities and demonstrating data-efficient reuse of subtask solutions. The method offers interpretability through decomposed subtask components and suggests a principled approach for tackling complex RL problems via modular task composition and reuse.

Abstract

Reinforcement learning (RL) for complex tasks remains a challenge, primarily due to the difficulties of engineering scalar reward functions and the inherent inefficiency of training models from scratch. Instead, it would be better to specify complex tasks in terms of elementary subtasks and to reuse subtask solutions whenever possible. In this work, we address continuous space lexicographic multi-objective RL problems, consisting of prioritized subtasks, which are notoriously difficult to solve. We show that these can be scalarized with a subtask transformation and then solved incrementally using value decomposition. Exploiting this insight, we propose prioritized soft Q-decomposition (PSQD), a novel algorithm for learning and adapting subtask solutions under lexicographic priorities in continuous state-action spaces. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. Its ability to use retained subtask training data for offline learning eliminates the need for new environment interaction during adaptation. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. PSQD provides an intuitive framework for tackling complex RL problems, offering insights into the inner workings of the subtask composition.
Paper Structure (43 sections, 8 theorems, 51 equations, 9 figures, 2 algorithms)

This paper contains 43 sections, 8 theorems, 51 equations, 9 figures, 2 algorithms.

Key Result

Theorem 3.1

Consider the soft Bellman backup operator $\mathcal{T}$, and an initial mapping $Q^0: \mathcal{S} \times \mathcal{A}_\succ \to \mathbb{R}$ with $|\mathcal{A}_\succ| < \infty$ and define $Q^{l+1} = \mathcal{T}Q^{l}$, then the sequence of $Q^l$ converges to $Q^*_\succ$, the soft Q-value of the optimal

Figures (9)

  • Figure 1: Zero-shot experiment in the 2D navigation environment. $\hat{Q}_1^*$ in \ref{['fig:q0_obst_greedy']} (brighter hues indicate higher value) and its transformed version in \ref{['fig:q0_constraint_(-6,-6)']} (evaluated at red dot) forbid actions that lead to obstacle collisions. Sample traces in \ref{['fig:zeroshot_trajectories']} (larger version in Fig. \ref{['fig:larger_trajectory_figure']}) show navigation towards the goal at the top, sometimes getting stuck but without colliding with the obstacle. The background in \ref{['fig:zeroshot_trajectories']} is colored according to discretized angles of the policy .
  • Figure 2: Offline adaptation experiment in 2D navigation environment. Our learning algorithm adapts the pre-trained $\hat{Q}_2^*$ in \ref{['fig:q1_top_greedy']} to $\hat{Q}_2^{\pi_\succ}$ in \ref{['fig:q1_top_adapted']} (brighter hues indicate higher value), reflecting the long-term value of $r_2$ under the arbiter policy. The adapted agent has learned to drive out of and around the obstacle, as shown in \ref{['fig:adapted_trajectories']} (larger version in Fig. \ref{['fig:larger_trajectory_figure']}). The background in \ref{['fig:adapted_trajectories']} is colored in the same way as \ref{['fig:zeroshot_trajectories']}. Both online and offline adaptation improve upon the zero-shot agent considerably, as shown in \ref{['fig:2d-reward-comparison']}.
  • Figure 3: Baseline comparison in the 2D navigation environment. Left: Cost of the high-priority obstacle avoidance subtask during learning of the lower priority task. Right: Lower-priority navigation cost. For SAC and PPO, the scalar in the legend refers to the weight $\beta$ of the convex policy objective, where $\beta=1$ places all weight on the KL term.
  • Figure 4: High-dimensional joint control using a simulated Franka Emika Panda robot. Top left: Our prioritized agent respects constraints even in the zero-shot setting and is improved by our learning algorithm. Bottom left: A multi-objective baseline haarnoja2018composable does not respect lexicographic priorities and generates high costs in $r_1$ (collision) in favor of fewer costs in $r_2$ (trajectory length), moving through the forbidden part of the workspace.
  • Figure 5: A high-level overview of our method. Starting on the left, $n$ agents individually learn to solve each subtask, we refer to this as the subtask pre-training step. In the middle box, the subtask agents are combined into the lexicographic arbiter agent. The subtask adaptation loop that we implement in practice, as described in Sec. \ref{['sec:learning']} and \ref{['sec:practical-algorithm']}, is denoted by , while the arbiter learning perspective, described in App. \ref{['app:globa_view']}, is denoted by .
  • ...and 4 more figures

Theorems & Definitions (16)

  • Theorem 3.1: Prioritized Soft Q-learning
  • proof
  • Theorem B.1: Arbiter policy evaluation
  • proof
  • Theorem B.2: Arbiter policy improvement
  • proof
  • Corollary B.3: Arbiter policy iteration
  • proof
  • Lemma F.1
  • proof
  • ...and 6 more