Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Takuya Kanazawa; Chetan Gupta

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Takuya Kanazawa, Chetan Gupta

TL;DR

A novel multi-objective reinforcement learning (MORL) algorithm that trains a single neural network via policy gradient to approximately obtain the entire Pareto set in a single run of training, without relying on linear scalarization of objectives.

Abstract

Sequential decision making in the real world often requires finding a good balance of conflicting objectives. In general, there exist a plethora of Pareto-optimal policies that embody different patterns of compromises between objectives, and it is technically challenging to obtain them exhaustively using deep neural networks. In this work, we propose a novel multi-objective reinforcement learning (MORL) algorithm that trains a single neural network via policy gradient to approximately obtain the entire Pareto set in a single run of training, without relying on linear scalarization of objectives. The proposed method works in both continuous and discrete action spaces with no design change of the policy network. Numerical experiments in benchmark environments demonstrate the practicality and efficacy of our approach in comparison to standard MORL baselines.

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

TL;DR

Abstract

Paper Structure (37 sections, 10 equations, 16 figures, 7 tables, 4 algorithms)

This paper contains 37 sections, 10 equations, 16 figures, 7 tables, 4 algorithms.

Introduction
Related work
Scalarization-based MORL
MORL without explicit scalarization
Implicit generative networks
Problem formulation
Markov decision processes
Policy gradient
Multi-objective MDP
Methodology
Normalization of returns
Scoring of returns
Bonus computation
Clipping
Environments
...and 22 more sections

Figures (16)

Figure 1: Illustration of the proposed algorithm (LC-MOPG) for a bi-objective problem. There is an alternative that uses value networks in addition to the policy network (LC-MOPG-V).
Figure 2: Architecture of the policy network.
Figure 3: The DST environment. Orange cells are treasures and blue cells are the ocean.
Figure 4: Minecart environment. The cart departs from the top-left corner, goes for mining at any of the 5 mines (blue circles), and returns home (red quarter circle) to sell ores.
Figure 5: Exact PF of the DST environment. The vertical axis is the cumulative time cost, and the horizontal axis is the treasure reward. Left: For convex treasure values and $\gamma=0.99$. Right: For original treasure values and $\gamma=1.0$. These discount rates were chosen to ensure a fair comparison with related work.
...and 11 more figures

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

TL;DR

Abstract

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)