Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning
Finn Rietz, Erik Schaffernicht, Stefan Heinrich, Johannes Andreas Stork
TL;DR
This work addresses lexicographic multi-objective reinforcement learning in continuous action spaces by introducing a subtask transformation and a decomposed learning framework. The method, prioritized soft Q-decomposition (PSQD), scalarizes the lexicographic problem via $Q_{\succ}(\mathbf{s},\mathbf{a}) = \sum_{i=1}^{n-1} \ln(c_i(\mathbf{s},\mathbf{a})) + Q_n(\mathbf{s},\mathbf{a})$, with transformed rewards $r_{\succ i}=\ln(c_i)$ and $r_{\succ n}=r_n$, enabling incremental learning and reuse of higher-priority subtasks. The paper provides subtask-view and arbiter-view theoretical analyses, along with a practical algorithm for continuous spaces that can perform offline adaptation using retained data. Empirical results on a 2D navigation task and a high-dimensional Franka Panda control task show that PSQD preserves lexicographic priorities in zero-shot and online adaptation, outperforming baselines that fail to respect priorities and demonstrating data-efficient reuse of subtask solutions. The method offers interpretability through decomposed subtask components and suggests a principled approach for tackling complex RL problems via modular task composition and reuse.
Abstract
Reinforcement learning (RL) for complex tasks remains a challenge, primarily due to the difficulties of engineering scalar reward functions and the inherent inefficiency of training models from scratch. Instead, it would be better to specify complex tasks in terms of elementary subtasks and to reuse subtask solutions whenever possible. In this work, we address continuous space lexicographic multi-objective RL problems, consisting of prioritized subtasks, which are notoriously difficult to solve. We show that these can be scalarized with a subtask transformation and then solved incrementally using value decomposition. Exploiting this insight, we propose prioritized soft Q-decomposition (PSQD), a novel algorithm for learning and adapting subtask solutions under lexicographic priorities in continuous state-action spaces. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. Its ability to use retained subtask training data for offline learning eliminates the need for new environment interaction during adaptation. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. PSQD provides an intuitive framework for tackling complex RL problems, offering insights into the inner workings of the subtask composition.
