PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

Tianmeng Hu; Biao Luo

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

Tianmeng Hu, Biao Luo

Abstract

Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

Abstract

Paper Structure (21 sections, 15 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 15 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Single-Policy Methods
Multi-Policy Methods
Preliminaries
Multi-objective Decision-making
Pareto Optimality
Method
Pareto Ascent Directional Decomposition
Pareto Ascent Direction
Partitioned Greedy Randomized Policy Selection
Pareto Adaptive Fine-tuning
Experiments
Evaluation Metrics
Simulation Environment
...and 6 more sections

Figures (4)

Figure 1: Illustration of the PGR and PA-FT methods. (a) A portion of the better performing policies in each partition is selected as candidate policies, and then one is randomly selected from the candidates. (b) The larger missing regions in the current Pareto frontier approximation are identified. Policies around the missing region are selected and fine-tuned.
Figure 2: Hypervolume and sparsity curves on the Walker environment. The light-colored parts show the standard deviation. Data are based on 6 independent runs.
Figure 3: Comparison of Pareto frontier approximations. Results of PA2D-MORL, PGMORL, and MOEA/D are shown. PA2D-MORL achieves a higher-quality policy set.
Figure 4: Comparison of Pareto frontier approximations. Results of PA2D-MORL and PA2D-ablated are shown. PA2D-MORL achieves a denser policy set.

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

Abstract

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (4)