Table of Contents
Fetching ...

Inverse Reinforcement Learning by Estimating Expertise of Demonstrators

Mark Beliaev, Ramtin Pedarsani

TL;DR

This work tackles learning from suboptimal and heterogeneous demonstrations in imitation learning by introducing IRLEED, a framework that combines a Boltzmann-based suboptimality model for demonstrators with a Maximum Entropy IRL objective. By jointly estimating a ground-truth reward parameter $\theta$, per-demonstrator biases $\epsilon_i$, and precision $\beta_i$, IRLEED accounts for reward bias and action variance across sources, enabling dynamics-aware recovery of the optimal policy. The approach generalizes standard IRL and ILEED, provides gradient-based update rules, and supports practical neural-network-based extensions with regularization. Empirical results across online and offline tasks, including simulations and human data, show IRLEED achieving higher returns and better reward recovery than baselines, particularly when learning from mixed-quality demonstrations, which has strong implications for crowd-sourced and cross-source IL settings.

Abstract

In Imitation Learning (IL), utilizing suboptimal and heterogeneous demonstrations presents a substantial challenge due to the varied nature of real-world data. However, standard IL algorithms consider these datasets as homogeneous, thereby inheriting the deficiencies of suboptimal demonstrators. Previous approaches to this issue rely on impractical assumptions like high-quality data subsets, confidence rankings, or explicit environmental knowledge. This paper introduces IRLEED, Inverse Reinforcement Learning by Estimating Expertise of Demonstrators, a novel framework that overcomes these hurdles without prior knowledge of demonstrator expertise. IRLEED enhances existing Inverse Reinforcement Learning (IRL) algorithms by combining a general model for demonstrator suboptimality to address reward bias and action variance, with a Maximum Entropy IRL framework to efficiently derive the optimal policy from diverse, suboptimal demonstrations. Experiments in both online and offline IL settings, with simulated and human-generated data, demonstrate IRLEED's adaptability and effectiveness, making it a versatile solution for learning from suboptimal demonstrations.

Inverse Reinforcement Learning by Estimating Expertise of Demonstrators

TL;DR

This work tackles learning from suboptimal and heterogeneous demonstrations in imitation learning by introducing IRLEED, a framework that combines a Boltzmann-based suboptimality model for demonstrators with a Maximum Entropy IRL objective. By jointly estimating a ground-truth reward parameter , per-demonstrator biases , and precision , IRLEED accounts for reward bias and action variance across sources, enabling dynamics-aware recovery of the optimal policy. The approach generalizes standard IRL and ILEED, provides gradient-based update rules, and supports practical neural-network-based extensions with regularization. Empirical results across online and offline tasks, including simulations and human data, show IRLEED achieving higher returns and better reward recovery than baselines, particularly when learning from mixed-quality demonstrations, which has strong implications for crowd-sourced and cross-source IL settings.

Abstract

In Imitation Learning (IL), utilizing suboptimal and heterogeneous demonstrations presents a substantial challenge due to the varied nature of real-world data. However, standard IL algorithms consider these datasets as homogeneous, thereby inheriting the deficiencies of suboptimal demonstrators. Previous approaches to this issue rely on impractical assumptions like high-quality data subsets, confidence rankings, or explicit environmental knowledge. This paper introduces IRLEED, Inverse Reinforcement Learning by Estimating Expertise of Demonstrators, a novel framework that overcomes these hurdles without prior knowledge of demonstrator expertise. IRLEED enhances existing Inverse Reinforcement Learning (IRL) algorithms by combining a general model for demonstrator suboptimality to address reward bias and action variance, with a Maximum Entropy IRL framework to efficiently derive the optimal policy from diverse, suboptimal demonstrations. Experiments in both online and offline IL settings, with simulated and human-generated data, demonstrate IRLEED's adaptability and effectiveness, making it a versatile solution for learning from suboptimal demonstrations.
Paper Structure (16 sections, 3 theorems, 10 equations, 4 figures, 2 tables)

This paper contains 16 sections, 3 theorems, 10 equations, 4 figures, 2 tables.

Key Result

Proposition 4.1

$\theta^\star$ is the (non--unique) maximizer of $\mathcal{L}(\theta)$.

Figures (4)

  • Figure 1: IRLEED is applied in the suboptimal setting to estimate the ground truth reward $r$, which is used to find the optimal policy. Left: A heterogeneous dataset is collected from multiple sources with varying optimality. We categorize this optimality by using accuracy to represent the reward bias, and precision to represent the variance in action choices. Right: We infer the demonstrator policies using a model for behavior based on the Boltzmann rationality principle, which captures both the accuracy $\epsilon_i$, and the precision $\beta_i$, as compared to the ground truth reward $r$. Middle: Using estimates of the demonstrator policies $\hat{\pi}_i$ along with the demonstrations $\mathcal{D}_i$, we can optimize for the true reward $r$, and the parameters that capture accuracy $\epsilon$ and precision $\beta$.
  • Figure 2: We visualize the reward recovered by IRLEED and IRL when trained using suboptimal demonstrations. Top left shows the true reward, where the three yellow corners are terminal states. Top right shows the normalized state visitation frequency over the entire dataset. Bottom left and right show the normalized rewards recovered by IRLEED and IRL respectively. We can see that the provided demonstrations are misaligned with the ground truth reward: the state visitation frequency for the top left corner is higher due to demonstrator suboptimalities. As expected, the feature matching constraint of IRL absorbs this suboptimality. Although the reward recovered by IRL contains information about the true reward, it is incorrectly biasing the top left corner. On the other hand, IRLEED is able to remove this bias, providing a better estimate of the ground truth reward.
  • Figure 3: We compare the performance of the policies recovered by IRLEED and IRL. The left plot shows the relative performance of IRLEED over IRL under varying dataset settings, where the top right corner corresponds to expert data. The right plot shows the performance of both policies as we increase the accuracy $\epsilon$ of demonstrators in the dataset, corresponding to the data outlined in red in the left plot. On average, IRLEED provided a $30.3\%$ improvement over IRL.
  • Figure 4: We plot the mean episode return of the policies learned by IRLEED (solid) and IQ (dashed) as they train on data from a single source (right) vs a mixture source (left).

Theorems & Definitions (7)

  • Remark 3.1
  • Proposition 4.1
  • Lemma 4.2
  • proof
  • Remark 4.3
  • Proposition 4.4
  • Remark 4.5