Table of Contents
Fetching ...

Distributional Inverse Reinforcement Learning

Feiyang Wu, Ye Zhao, Anqi Wu

TL;DR

DistIRL addresses the limitation of point-valued rewards in offline IRL by learning both the reward distribution $q_\rho(r|s,a)$ and the full return distribution $Z^\pi$, guided by first-order stochastic dominance (FSD) and distortion risk measures (DRMs). The framework combines energy-based Bayesian reward learning with distributional RL techniques to produce distribution-aware policies without environment interaction. Empirical results across gridworld, neuroscience data, and MuJoCo demonstrate accurate recovery of reward shapes and state-of-the-art imitation performance under risk-sensitive objectives. This yields robust, risk-aware imitation capabilities suitable for behavior analysis and neuroscience applications, with broad applicability to offline scenarios where rewards are stochastic.

Abstract

We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art imitation performance.

Distributional Inverse Reinforcement Learning

TL;DR

DistIRL addresses the limitation of point-valued rewards in offline IRL by learning both the reward distribution and the full return distribution , guided by first-order stochastic dominance (FSD) and distortion risk measures (DRMs). The framework combines energy-based Bayesian reward learning with distributional RL techniques to produce distribution-aware policies without environment interaction. Empirical results across gridworld, neuroscience data, and MuJoCo demonstrate accurate recovery of reward shapes and state-of-the-art imitation performance under risk-sensitive objectives. This yields robust, risk-aware imitation capabilities suitable for behavior analysis and neuroscience applications, with broad applicability to offline scenarios where rewards are stochastic.

Abstract

We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art imitation performance.

Paper Structure

This paper contains 22 sections, 5 theorems, 30 equations, 12 figures, 6 tables, 1 algorithm.

Key Result

Proposition 4.2

For real‐valued $X$ and $Y$, the following are equivalent:

Figures (12)

  • Figure 1: Illustration of quantile functions and first-order stochastic dominance (FSD).
  • Figure 2: Inferring reward mean and variance in the gridworld example with $10$ demonstrations.
  • Figure 3: Learned reward distribution versus recorded dopamine signals and their empirical CDFs.
  • Figure 4: Left: Pearson correlation of the reward mean and dopamine level. Right: W-1 loss between learned distribution and dopamine level.
  • Figure 5: Return distributions comparison in HalfCheetah.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Definition 4.1: First‐Order Stochastic Dominance (FSD) hadar1969rules
  • Proposition 4.2: Theorem 1-2 hadar1969rules
  • Corollary 4.3: Mean Dominance
  • Definition 4.4: Distortion function
  • Definition 4.5: Distortion Risk Measure (DRM) dhaene2012remarks
  • Proposition 4.5
  • Proposition B.1
  • proof
  • Proposition B.1
  • proof