Table of Contents
Fetching ...

Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning

Yantian Zha, Lin Guan, Subbarao Kambhampati

TL;DR

Ambiguity in human demonstrations can destabilize reinforcement learning from demonstrations (RLfD). The authors introduce SERLfD, a framework that learns self-explanations of predicate utilities via a Self-Explanation Network (SE-Net) and uses them to augment states or reward signals within a GAN-IRL–inspired training loop. The approach jointly trains a generator agent and the SE-Net, grounding predicates in a predefined vocabulary and leveraging success/failure buffers to identify task-relevant relations. Empirical results across continuous robotic domains and a discrete Pacman task show improved training stability and performance over strong RLfD baselines and GAN-IRL variants, demonstrating SERLfD’s effectiveness in mitigating ambiguity in demonstrations.

Abstract

Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL-Agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use self-explanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can provide some guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of traditional RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance.

Learning from Ambiguous Demonstrations with Self-Explanation Guided Reinforcement Learning

TL;DR

Ambiguity in human demonstrations can destabilize reinforcement learning from demonstrations (RLfD). The authors introduce SERLfD, a framework that learns self-explanations of predicate utilities via a Self-Explanation Network (SE-Net) and uses them to augment states or reward signals within a GAN-IRL–inspired training loop. The approach jointly trains a generator agent and the SE-Net, grounding predicates in a predefined vocabulary and leveraging success/failure buffers to identify task-relevant relations. Empirical results across continuous robotic domains and a discrete Pacman task show improved training stability and performance over strong RLfD baselines and GAN-IRL variants, demonstrating SERLfD’s effectiveness in mitigating ambiguity in demonstrations.

Abstract

Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL-Agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use self-explanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can provide some guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of traditional RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance.

Paper Structure

This paper contains 24 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Fig \ref{['fig:illustration']}.1 shows the Robot-Push domain: There are two target regions which we can index as L1 and L2. L1 and L2 are also randomly assigned with the colors yellow-and-blue or blue-and-yellow in each episode. A human user demonstrates the task of pushing the ring and block into the blue and yellow region respectively. Fig \ref{['fig:illustration']}.2: a three-step robot execution with grounded predicates (p) and predicted self-explanations (u).
  • Figure 2: The SERLfD framework that couples learning Self-Explanations Networks (SE-Nets) and an RL agent. Roboticists provide predicates as human-related background domain knowledge to help robots disambiguate non-expert demonstrations for specific tasks. Buffers $\mathcal{D}_{success}$ stores successful experiences which include those from demonstrations. $\mathcal{D}_{failure}$ stores unsuccessful experiences. We train Self-Explanation Network (SE-Net) by inserting it into a Discriminator that distinguishes between successful and failed experiences in a Generative-Adversarial framework.
  • Figure 3: Learning curves of training the baseline RLfD agents (TD3fD/SACfD), TD3fD/SACfD with SE-Nets that uses self-explanation to augment states (TD3fD/SACfD+SE+nrs), augment rewards (TD3fD/SACfD+SE+nu), or both (TD3fD/SACfD+SE), and an Imitation Learning agent built by using RL in the original SA-GAN-GCL framework fu2017learning; The blue, red, aqua, magenta, and gold curves are the results of baseline TD3fD/SACfD, TD3fD/SACfD+SE, TD3fD/SACfD+SE+nrs, TD3fD/SACfD+SE+nu, and SA-GAN-GCL respectively. For each curve, we run three times of each algorithm and report the mean and standard-variance, which are plotted in bold and lighter color regions respectively. y-axis values are scores that each is measured as an average of over 100 episodes. x-axis values are episodes.
  • Figure 4: Visualizing the predicted self-explanations from agents TD3fD+SE and SACfD+SE in Robot-Push-Simple domain.
  • Figure 5: Examples of predicted self-explanations from agents TD3fD+SE and SACfD+SE in the Robot-Push domain.
  • ...and 4 more figures