Table of Contents
Fetching ...

Fast Lifelong Adaptive Inverse Reinforcement Learning from Demonstrations

Letian Chen, Sravan Jayanthi, Rohan Paleja, Daniel Martin, Viacheslav Zakharov, Matthew Gombolay

TL;DR

FLAIR tackles fast, lifelong learning from heterogeneous demonstrations by decomposing tasks into a compact set of prototypical strategies and composing them as policy mixtures. Built on an AIRL backbone with a MSRD-inspired reward decomposition and a novel Between-Class Discrimination objective, it distills shared knowledge while preserving user-specific preferences. The framework automatically decides when to reuse existing prototypes versus create new ones, enabling sample-efficient adaptation and sublinear growth in the number of strategies as demonstrations accumulate, with strong results across OpenAI Gym tasks and a real-world table tennis case study. This approach offers scalable, personalized LfD suitable for deployment in ubiquitous robots, while acknowledging limitations related to initial demo diversity and reward non-stationarity.

Abstract

Learning from Demonstration (LfD) approaches empower end-users to teach robots novel tasks via demonstrations of the desired behaviors, democratizing access to robotics. However, current LfD frameworks are not capable of fast adaptation to heterogeneous human demonstrations nor the large-scale deployment in ubiquitous robotics applications. In this paper, we propose a novel LfD framework, Fast Lifelong Adaptive Inverse Reinforcement learning (FLAIR). Our approach (1) leverages learned strategies to construct policy mixtures for fast adaptation to new demonstrations, allowing for quick end-user personalization, (2) distills common knowledge across demonstrations, achieving accurate task inference; and (3) expands its model only when needed in lifelong deployments, maintaining a concise set of prototypical strategies that can approximate all behaviors via policy mixtures. We empirically validate that FLAIR achieves adaptability (i.e., the robot adapts to heterogeneous, user-specific task preferences), efficiency (i.e., the robot achieves sample-efficient adaptation), and scalability (i.e., the model grows sublinearly with the number of demonstrations while maintaining high performance). FLAIR surpasses benchmarks across three control tasks with an average 57% improvement in policy returns and an average 78% fewer episodes required for demonstration modeling using policy mixtures. Finally, we demonstrate the success of FLAIR in a table tennis task and find users rate FLAIR as having higher task (p<.05) and personalization (p<.05) performance.

Fast Lifelong Adaptive Inverse Reinforcement Learning from Demonstrations

TL;DR

FLAIR tackles fast, lifelong learning from heterogeneous demonstrations by decomposing tasks into a compact set of prototypical strategies and composing them as policy mixtures. Built on an AIRL backbone with a MSRD-inspired reward decomposition and a novel Between-Class Discrimination objective, it distills shared knowledge while preserving user-specific preferences. The framework automatically decides when to reuse existing prototypes versus create new ones, enabling sample-efficient adaptation and sublinear growth in the number of strategies as demonstrations accumulate, with strong results across OpenAI Gym tasks and a real-world table tennis case study. This approach offers scalable, personalized LfD suitable for deployment in ubiquitous robots, while acknowledging limitations related to initial demo diversity and reward non-stationarity.

Abstract

Learning from Demonstration (LfD) approaches empower end-users to teach robots novel tasks via demonstrations of the desired behaviors, democratizing access to robotics. However, current LfD frameworks are not capable of fast adaptation to heterogeneous human demonstrations nor the large-scale deployment in ubiquitous robotics applications. In this paper, we propose a novel LfD framework, Fast Lifelong Adaptive Inverse Reinforcement learning (FLAIR). Our approach (1) leverages learned strategies to construct policy mixtures for fast adaptation to new demonstrations, allowing for quick end-user personalization, (2) distills common knowledge across demonstrations, achieving accurate task inference; and (3) expands its model only when needed in lifelong deployments, maintaining a concise set of prototypical strategies that can approximate all behaviors via policy mixtures. We empirically validate that FLAIR achieves adaptability (i.e., the robot adapts to heterogeneous, user-specific task preferences), efficiency (i.e., the robot achieves sample-efficient adaptation), and scalability (i.e., the model grows sublinearly with the number of demonstrations while maintaining high performance). FLAIR surpasses benchmarks across three control tasks with an average 57% improvement in policy returns and an average 78% fewer episodes required for demonstration modeling using policy mixtures. Finally, we demonstrate the success of FLAIR in a table tennis task and find users rate FLAIR as having higher task (p<.05) and personalization (p<.05) performance.
Paper Structure (14 sections, 1 theorem, 5 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 1 theorem, 5 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Under the maximum entropy principal,

Figures (7)

  • Figure 1: This figure shows an illustration of the lifelong learning process with our proposed method, FLAIR. As each demonstrator performs their strike, FLAIR determines whether the demonstration is novel. If a demonstration can be explained by a policy mixture of previously learned strategies, FLAIR accepts the policy mixture without training a new strategy. If the policy mixture is not close to the demonstration, FLAIR creates a new strategy and a prototype policy for the demonstration.
  • Figure 2: This figure shows the correlation between the estimated task reward with the ground truth task reward for Inverted Pendulum. Each dot is a trajectory. FLAIR achieves a higher task reward correlation.
  • Figure 3: This figure compares the number of episodes needed for AIRL and MSRD to achieve the same Log Likelihood as FLAIR's mixture optimization. The red bar is the median and the red triangle represents the mean.
  • Figure 4: This figure depicts the normalized strategy rewards on demonstrations in IP for FLAIR without BCD (left) and with BCD (right).
  • Figure 5: This figure plots the returns of FLAIR policies in a 100 demonstration experiment in Inverted Pendulum.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Lemma 1