Table of Contents
Fetching ...

Explaining Learned Reward Functions with Counterfactual Trajectories

Jan Wehner, Frans Oliehoek, Luciano Cavalcante Siebert

TL;DR

Counterfactual Trajectory Explanations (CTEs) are proposed to interpret reward functions in reinforcement learning by contrasting an original with a counterfactual partial trajectory and the rewards they each receive.

Abstract

Learning rewards from human behaviour or feedback is a promising approach to aligning AI systems with human values but fails to consistently extract correct reward functions. Interpretability tools could enable users to understand and evaluate possible flaws in learned reward functions. We propose Counterfactual Trajectory Explanations (CTEs) to interpret reward functions in reinforcement learning by contrasting an original with a counterfactual partial trajectory and the rewards they each receive. We derive six quality criteria for CTEs and propose a novel Monte-Carlo-based algorithm for generating CTEs that optimises these quality criteria. Finally, we measure how informative the generated explanations are to a proxy-human model by training it on CTEs. CTEs are demonstrably informative for the proxy-human model, increasing the similarity between its predictions and the reward function on unseen trajectories. Further, it learns to accurately judge differences in rewards between trajectories and generalises to out-of-distribution examples. Although CTEs do not lead to a perfect understanding of the reward, our method, and more generally the adaptation of XAI methods, are presented as a fruitful approach for interpreting learned reward functions.

Explaining Learned Reward Functions with Counterfactual Trajectories

TL;DR

Counterfactual Trajectory Explanations (CTEs) are proposed to interpret reward functions in reinforcement learning by contrasting an original with a counterfactual partial trajectory and the rewards they each receive.

Abstract

Learning rewards from human behaviour or feedback is a promising approach to aligning AI systems with human values but fails to consistently extract correct reward functions. Interpretability tools could enable users to understand and evaluate possible flaws in learned reward functions. We propose Counterfactual Trajectory Explanations (CTEs) to interpret reward functions in reinforcement learning by contrasting an original with a counterfactual partial trajectory and the rewards they each receive. We derive six quality criteria for CTEs and propose a novel Monte-Carlo-based algorithm for generating CTEs that optimises these quality criteria. Finally, we measure how informative the generated explanations are to a proxy-human model by training it on CTEs. CTEs are demonstrably informative for the proxy-human model, increasing the similarity between its predictions and the reward function on unseen trajectories. Further, it learns to accurately judge differences in rewards between trajectories and generalises to out-of-distribution examples. Although CTEs do not lead to a perfect understanding of the reward, our method, and more generally the adaptation of XAI methods, are presented as a fruitful approach for interpreting learned reward functions.
Paper Structure (52 sections, 18 figures, 8 tables, 2 algorithms)

This paper contains 52 sections, 18 figures, 8 tables, 2 algorithms.

Figures (18)

  • Figure 1: A car has originally taken a straight line and received a reward of $+4$ from the reward function. By providing a counterfactual that receives a lower reward of $+2$ the user can make hypotheses about how the reward function assigns rewards.
  • Figure 2: Schematic that describes how rewards are learned (1), explanations are generated (2) and evaluated (3,4&5).
  • Figure 3: The average informativeness of CTEs generated by MCTO, DaC and Random for a NN trained for single and contrastive predictions, along with median, upper and lower quartile.
  • Figure 4: Spearman correlation between weights for the quality criteria and the informativeness of the resulting CTEs for $M_\phi$ for the contrastive and single task. Averaged over 10 models along with the median and upper and lower quartile.
  • Figure 5: A random initialisation of the Emergency environment by Peschl et al. peschl2022moral. Shows the player P in blue, the humans C in green, the obstacles H in brown, the fire-extinguisher G in pink and the borders of the environment # in brown.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Definition 2.1