Domains as Objectives: Domain-Uncertainty-Aware Policy Optimization through Explicit Multi-Domain Convex Coverage Set Learning

Wendyam Eric Lionel Ilboudo; Taisuke Kobayashi; Takamitsu Matsubara

Domains as Objectives: Domain-Uncertainty-Aware Policy Optimization through Explicit Multi-Domain Convex Coverage Set Learning

Wendyam Eric Lionel Ilboudo, Taisuke Kobayashi, Takamitsu Matsubara

TL;DR

The challenge of efficiently optimizing uncertainty-aware policies can be fundamentally reframed as solving the convex coverage set (CCS) problem within a multi-objective reinforcement learning (MORL) context and a series of algorithms adapted from the MORL literature are proposed to solve the CCS, demonstrating their ability to enhance the performance of uncertainty-aware policies.

Abstract

The problem of uncertainty is a feature of real world robotics problems and any control framework must contend with it in order to succeed in real applications tasks. Reinforcement Learning is no different, and epistemic uncertainty arising from model uncertainty or misspecification is a challenge well captured by the sim-to-real gap. A simple solution to this issue is domain randomization (DR), which unfortunately can result in conservative agents. As a remedy to this conservativeness, the use of universal policies that take additional information about the randomized domain has risen as an alternative solution, along with recurrent neural network-based controllers. Uncertainty-aware universal policies present a particularly compelling solution able to account for system identification uncertainties during deployment. In this paper, we reveal that the challenge of efficiently optimizing uncertainty-aware policies can be fundamentally reframed as solving the convex coverage set (CCS) problem within a multi-objective reinforcement learning (MORL) context. By introducing a novel Markov decision process (MDP) framework where each domain's performance is treated as an independent objective, we unify the training of uncertainty-aware policies with MORL approaches. This connection enables the application of MORL algorithms for domain randomization (DR), allowing for more efficient policy optimization. To illustrate this, we focus on the linear utility function, which aligns with the expectation in DR formulations, and propose a series of algorithms adapted from the MORL literature to solve the CCS, demonstrating their ability to enhance the performance of uncertainty-aware policies.

Domains as Objectives: Domain-Uncertainty-Aware Policy Optimization through Explicit Multi-Domain Convex Coverage Set Learning

TL;DR

Abstract

Paper Structure (43 sections, 45 equations, 11 figures, 16 tables, 1 algorithm)

This paper contains 43 sections, 45 equations, 11 figures, 16 tables, 1 algorithm.

Introduction
Background
Domain Randomization
Reinforcement Learning:
Domain Randomization:
Multi-Objective Reinforcement Learning and Convex Coverage Sets
Multi-Objective Reinforcement Learning:
Convex Coverage Sets:
Pseudo-Multi-Objective Problem for Convex Coverage Set Optimization
Multi-Domain Reinforcement Learning as a Pseudo-Multi-Objective Problem
Multi-Domain Uncertainty-Aware CCS Optimization
Conditioned MDRL:
MDRL with CCS optimality filter (Envelope MDRL):
Utopia-based MDRL:
Qualitative Analysis of the proposed algorithms
...and 28 more sections

Figures (11)

Figure 1: Domain randomization sacrifices the overall performance in order to produce policies which generalize to a wide range of conditions, leading to conservative behaviors. The bigger the randomized domain parameter space is, the more conservative the policy needs to be.
Figure 2: Visualization of a Pareto Coverage Set (PCS) and its Convex Coverage Set (CCS)
Figure 3: Visualization of the piecewise-linear and convex (PWLC) nature of $V^{*}(\boldsymbol{\omega})$
Figure 5: Analogy between multi-objective RL and domain randomization as a multi-domain RL problem.
Figure 6: The options available to us for learning the CCS here illustrated in the case where we have two discrete uncertainties or preferences $\boldsymbol{\varpi}$ and $\boldsymbol{\varpi}'$: (a) Learn two different policies, one for each uncertainty vector, as performed in the sMORL approach. (b) Learn a single function of the uncertainty vector that takes the vector as input and generates the corresponding optimal policy. This function of $\boldsymbol{\varpi}$ is known as a Universal Policy (UP).
...and 6 more figures

Theorems & Definitions (2)

Definition 1: Pseudo-MOMDP or PMOMDP
Definition 2: ($\alpha$-th order Unscented Tranform)

Domains as Objectives: Domain-Uncertainty-Aware Policy Optimization through Explicit Multi-Domain Convex Coverage Set Learning

TL;DR

Abstract

Domains as Objectives: Domain-Uncertainty-Aware Policy Optimization through Explicit Multi-Domain Convex Coverage Set Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)

Theorems & Definitions (2)