Table of Contents
Fetching ...

CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

Yi-Shiuan Tung, Gyanig Kumar, Wei Jiang, Bradley Hayes, Alessandro Roncone

TL;DR

CRED is proposed, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users.

Abstract

As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning -- by sampling possible rewards from its current belief and asking "What if this were the true preference?" -- to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.

CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

TL;DR

CRED is proposed, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users.

Abstract

As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning -- by sampling possible rewards from its current belief and asking "What if this were the true preference?" -- to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.
Paper Structure (17 sections, 4 equations, 5 figures, 2 algorithms)

This paper contains 17 sections, 4 equations, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: The delivery robot above (goal: yellow pin) optimizes its path by considering factors like travel time and terrain types. Through active preference learning, it infers human rewards from trajectory rankings. (a) However, current state-of-the-art methods often struggle to efficiently generate informative trajectory pairs for these queries, leading to suboptimal results. To overcome this, CRED incorporates two key contributions: (b) counterfactual reasoning, which explores varied hypothetical preferences to produce more diverse trajectories, and (c) environment design, which "imagines" different scenarios--e.g. altering terrain from grass to gravel--to enhance the system's generalization capabilities. (d) As a result, the robot aligns to human preferences when deployed.
  • Figure 2: System overview of CRED as a bilevel optimization problem. Outer optimization (Environment Design): Bayesian optimization selects environment parameters $\theta$ to evaluate, seeking those that maximize estimates of query informativeness $F$. Inner optimization (Counterfactual Reasoning): Given $\theta$, the system samples reward weights from the current belief to generate candidate trajectories $\{\xi_1, \dots, \xi_M\}$. The most informative trajectory pair is returned to the outer optimization.
  • Figure 3: Examples of preference queries with details and visualization across the three simulation domains.
  • Figure 4: Accuracy of the estimated rewards across three domains (columns) for each experiment (rows).
  • Figure 5: Results from the user study: higher is better for (a)-(c), lower is better for (d). Within each plot the central line denotes the median, the upper and lower edges correspond to the first (Q1) and third (Q3) quartiles, and the whiskers extend to 1.5 times the interquartile range from Q1 and Q3. Statistically significant differences indicated as: * $p \leq 0.05$, ** $p \leq 0.01$.