Table of Contents
Fetching ...

Pareto Inverse Reinforcement Learning for Diverse Expert Policy Generation

Woo Kyung Kim, Minjong Yoo, Honguk Woo

TL;DR

This work tackles the challenge of deriving diverse Pareto-optimal policies from strictly limited expert data in multi-objective decision tasks. It introduces ParIRL, a two-phase framework that first grows a dense Pareto front via recursive reward distance regularized IRL using EPIC-based distance constraints, and then distills the front into a single preference-conditioned diffusion model for zero-shot customization. Theoretical regret bounds tie the learning performance to reward-distance metrics and trajectory-distribution changes, while extensive experiments across MO-Car, MuJoCo variants, and CARLA demonstrate superior Pareto frontier density and coverage compared with strong baselines. The approach enables efficient, adaptable deployment of diverse expert-like policies without requiring reward signals from the environment or fully labeled preference data.

Abstract

Data-driven offline reinforcement learning and imitation learning approaches have been gaining popularity in addressing sequential decision-making problems. Yet, these approaches rarely consider learning Pareto-optimal policies from a limited pool of expert datasets. This becomes particularly marked due to practical limitations in obtaining comprehensive datasets for all preferences, where multiple conflicting objectives exist and each expert might hold a unique optimization preference for these objectives. In this paper, we adapt inverse reinforcement learning (IRL) by using reward distance estimates for regularizing the discriminator. This enables progressive generation of a set of policies that accommodate diverse preferences on the multiple objectives, while using only two distinct datasets, each associated with a different expert preference. In doing so, we present a Pareto IRL framework (ParIRL) that establishes a Pareto policy set from these limited datasets. In the framework, the Pareto policy set is then distilled into a single, preference-conditioned diffusion model, thus allowing users to immediately specify which expert's patterns they prefer. Through experiments, we show that ParIRL outperforms other IRL algorithms for various multi-objective control tasks, achieving the dense approximation of the Pareto frontier. We also demonstrate the applicability of ParIRL with autonomous driving in CARLA.

Pareto Inverse Reinforcement Learning for Diverse Expert Policy Generation

TL;DR

This work tackles the challenge of deriving diverse Pareto-optimal policies from strictly limited expert data in multi-objective decision tasks. It introduces ParIRL, a two-phase framework that first grows a dense Pareto front via recursive reward distance regularized IRL using EPIC-based distance constraints, and then distills the front into a single preference-conditioned diffusion model for zero-shot customization. Theoretical regret bounds tie the learning performance to reward-distance metrics and trajectory-distribution changes, while extensive experiments across MO-Car, MuJoCo variants, and CARLA demonstrate superior Pareto frontier density and coverage compared with strong baselines. The approach enables efficient, adaptable deployment of diverse expert-like policies without requiring reward signals from the environment or fully labeled preference data.

Abstract

Data-driven offline reinforcement learning and imitation learning approaches have been gaining popularity in addressing sequential decision-making problems. Yet, these approaches rarely consider learning Pareto-optimal policies from a limited pool of expert datasets. This becomes particularly marked due to practical limitations in obtaining comprehensive datasets for all preferences, where multiple conflicting objectives exist and each expert might hold a unique optimization preference for these objectives. In this paper, we adapt inverse reinforcement learning (IRL) by using reward distance estimates for regularizing the discriminator. This enables progressive generation of a set of policies that accommodate diverse preferences on the multiple objectives, while using only two distinct datasets, each associated with a different expert preference. In doing so, we present a Pareto IRL framework (ParIRL) that establishes a Pareto policy set from these limited datasets. In the framework, the Pareto policy set is then distilled into a single, preference-conditioned diffusion model, thus allowing users to immediately specify which expert's patterns they prefer. Through experiments, we show that ParIRL outperforms other IRL algorithms for various multi-objective control tasks, achieving the dense approximation of the Pareto frontier. We also demonstrate the applicability of ParIRL with autonomous driving in CARLA.
Paper Structure (37 sections, 2 theorems, 40 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 37 sections, 2 theorems, 40 equations, 7 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathcal{D}_1, ..., \mathcal{D}_m$ be arbitrary distributions over transitions $S \times A \times S$. For $\alpha \geq m$ and $i \in \{1,...,m\}$, where $W_\alpha$ is relaxed Wasserstein distance (Definition A.13 in rw:epic).

Figures (7)

  • Figure 1: Data-driven Pareto policy set learning
  • Figure 2: Concept of Pareto policy set generation: given two distinct expert datasets, each associated with a specific preference over multi-objectives (e.g., some expert prefers speed over energy efficiency, and vice versa), the Pareto IRL is to find a set of optimal compromise custom policies, each of which can conform to a different preference.
  • Figure 3: $\text{ParIRL}$ framework: In (i), policies in a Pareto set are recursively derived via reward distance regularized IRL. In (ii), the preference-conditioned diffusion model enhances the approximated Pareto policy set via distillation.
  • Figure 4: Pareto policy set $\Pi$ visualization and learning curve
  • Figure 5: Visualization of agents obtained via $\text{ParIRL}$: the red rectangle denotes learned agent, the dotted lines denote lane changes, the dotted circles denote the vehicles overtaken by our agent.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1
  • proof