Table of Contents
Fetching ...

Inferring Preferences from Demonstrations in Multi-objective Reinforcement Learning

Junlin Lu, Patrick Mannion, Karl Mason

TL;DR

This work tackles the challenge of inferring human or agent preferences in multi-objective reinforcement learning from demonstrations without requiring user queries. It introduces DWPI, a regression-based preference inference framework powered by a dynamically weighted MORL data generator (DWMORL) that can produce both optimal and sub-optimal demonstrations. The method is theoretically grounded and demonstrates superior time efficiency and inference accuracy across three MORL benchmarks, while remaining robust to sub-optimal demonstrations and avoiding user interaction during inference. The results suggest DWPI can reliably recover preference vectors from demonstrations, enabling fast, forward inference in practical, multi-objective decision tasks. The work also provides a formal mapping between demonstrations and preferences, plus an analysis of complexity and demonstration representations, with release of code for reproducibility and potential extensions to multi-agent settings and non-linear utilities.

Abstract

Many decision-making problems feature multiple objectives where it is not always possible to know the preferences of a human or agent decision-maker for different objectives. However, demonstrated behaviors from the decision-maker are often available. This research proposes a dynamic weight-based preference inference (DWPI) algorithm that can infer the preferences of agents acting in multi-objective decision-making problems from demonstrations. The proposed algorithm is evaluated on three multi-objective Markov decision processes: Deep Sea Treasure, Traffic, and Item Gathering, and is compared to two existing preference inference algorithms. Empirical results demonstrate significant improvements compared to the baseline algorithms, in terms of both time efficiency and inference accuracy. The DWPI algorithm maintains its performance when inferring preferences for sub-optimal demonstrations. Moreover, the DWPI algorithm does not necessitate any interactions with the user during inference - only demonstrations are required. We provide a correctness proof and complexity analysis of the algorithm and statistically evaluate the performance under different representation of demonstrations.

Inferring Preferences from Demonstrations in Multi-objective Reinforcement Learning

TL;DR

This work tackles the challenge of inferring human or agent preferences in multi-objective reinforcement learning from demonstrations without requiring user queries. It introduces DWPI, a regression-based preference inference framework powered by a dynamically weighted MORL data generator (DWMORL) that can produce both optimal and sub-optimal demonstrations. The method is theoretically grounded and demonstrates superior time efficiency and inference accuracy across three MORL benchmarks, while remaining robust to sub-optimal demonstrations and avoiding user interaction during inference. The results suggest DWPI can reliably recover preference vectors from demonstrations, enabling fast, forward inference in practical, multi-objective decision tasks. The work also provides a formal mapping between demonstrations and preferences, plus an analysis of complexity and demonstration representations, with release of code for reproducibility and potential extensions to multi-agent settings and non-linear utilities.

Abstract

Many decision-making problems feature multiple objectives where it is not always possible to know the preferences of a human or agent decision-maker for different objectives. However, demonstrated behaviors from the decision-maker are often available. This research proposes a dynamic weight-based preference inference (DWPI) algorithm that can infer the preferences of agents acting in multi-objective decision-making problems from demonstrations. The proposed algorithm is evaluated on three multi-objective Markov decision processes: Deep Sea Treasure, Traffic, and Item Gathering, and is compared to two existing preference inference algorithms. Empirical results demonstrate significant improvements compared to the baseline algorithms, in terms of both time efficiency and inference accuracy. The DWPI algorithm maintains its performance when inferring preferences for sub-optimal demonstrations. Moreover, the DWPI algorithm does not necessitate any interactions with the user during inference - only demonstrations are required. We provide a correctness proof and complexity analysis of the algorithm and statistically evaluate the performance under different representation of demonstrations.
Paper Structure (40 sections, 11 equations, 8 figures, 2 tables, 4 algorithms)

This paper contains 40 sections, 11 equations, 8 figures, 2 tables, 4 algorithms.

Figures (8)

  • Figure 1: Train the DWPI model
  • Figure 2: CDST Environment (left): Agent in blue, treasures in yellow with numbers, walkable grids in light blue, unwalkable grids in black. Traffic Environment (middle): Agent in blue, item to collect in green, cars in red, roads in yellow, and walls in white. Item Gathering Environment (right): Agent in blue, fixed-preference agent in pink, three categories collectable items in green, red and yellow. A fixed number of each category of items is randomly placed in the environment at the start of each episode.
  • Figure 3: Time Efficiency Comparison
  • Figure 4: Inference accuracy Comparison
  • Figure 5: Inference Result CDST - Stochastic demonstration Ratio - 50%
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4