Table of Contents
Fetching ...

Overcoming Knowledge Barriers: Online Imitation Learning from Visual Observation with Pretrained World Models

Xingyuan Zhang, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl

TL;DR

This paper identifies two key barriers—Embodiment Knowledge Barrier (EKB) and Demonstration Knowledge Barrier (DKB)—in Imitation Learning from Observation using pretrained world models. It introduces AIME-NoB, which combines online interaction with a data-driven regulariser to mitigate EKB and employs a surrogate reward to enlarge state coverage for DKB, implemented via a dreamer-style latent-actor critic. Empirical results on vision-based control benchmarks (DeepMind Control Suite and MetaWorld) show that AIME-NoB substantially improves sample efficiency and final performance over state-of-the-art baselines, with the AIL surrogate variant performing best. The work demonstrates the practical potential of pretrained world models for ILfO and outlines avenues for scheduling, scaling up models, and multi-task extensions.

Abstract

Pretraining and finetuning models has become increasingly popular in decision-making. But there are still serious impediments in Imitation Learning from Observation (ILfO) with pretrained models. This study identifies two primary obstacles: the Embodiment Knowledge Barrier (EKB) and the Demonstration Knowledge Barrier (DKB). The EKB emerges due to the pretrained models' limitations in handling novel observations, which leads to inaccurate action inference. Conversely, the DKB stems from the reliance on limited demonstration datasets, restricting the model's adaptability across diverse scenarios. We propose separate solutions to overcome each barrier and apply them to Action Inference by Maximising Evidence (AIME), a state-of-the-art algorithm. This new algorithm, AIME-NoB, integrates online interactions and a data-driven regulariser to mitigate the EKB. Additionally, it uses a surrogate reward function to broaden the policy's supported states, addressing the DKB. Our experiments on vision-based control tasks from the DeepMind Control Suite and MetaWorld benchmarks show that AIME-NoB significantly improves sample efficiency and converged performance, presenting a robust framework for overcoming the challenges in ILfO with pretrained models. Code available at https://github.com/IcarusWizard/AIME-NoB.

Overcoming Knowledge Barriers: Online Imitation Learning from Visual Observation with Pretrained World Models

TL;DR

This paper identifies two key barriers—Embodiment Knowledge Barrier (EKB) and Demonstration Knowledge Barrier (DKB)—in Imitation Learning from Observation using pretrained world models. It introduces AIME-NoB, which combines online interaction with a data-driven regulariser to mitigate EKB and employs a surrogate reward to enlarge state coverage for DKB, implemented via a dreamer-style latent-actor critic. Empirical results on vision-based control benchmarks (DeepMind Control Suite and MetaWorld) show that AIME-NoB substantially improves sample efficiency and final performance over state-of-the-art baselines, with the AIL surrogate variant performing best. The work demonstrates the practical potential of pretrained world models for ILfO and outlines avenues for scheduling, scaling up models, and multi-task extensions.

Abstract

Pretraining and finetuning models has become increasingly popular in decision-making. But there are still serious impediments in Imitation Learning from Observation (ILfO) with pretrained models. This study identifies two primary obstacles: the Embodiment Knowledge Barrier (EKB) and the Demonstration Knowledge Barrier (DKB). The EKB emerges due to the pretrained models' limitations in handling novel observations, which leads to inaccurate action inference. Conversely, the DKB stems from the reliance on limited demonstration datasets, restricting the model's adaptability across diverse scenarios. We propose separate solutions to overcome each barrier and apply them to Action Inference by Maximising Evidence (AIME), a state-of-the-art algorithm. This new algorithm, AIME-NoB, integrates online interactions and a data-driven regulariser to mitigate the EKB. Additionally, it uses a surrogate reward function to broaden the policy's supported states, addressing the DKB. Our experiments on vision-based control tasks from the DeepMind Control Suite and MetaWorld benchmarks show that AIME-NoB significantly improves sample efficiency and converged performance, presenting a robust framework for overcoming the challenges in ILfO with pretrained models. Code available at https://github.com/IcarusWizard/AIME-NoB.
Paper Structure (24 sections, 20 equations, 20 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 20 equations, 20 figures, 3 tables, 1 algorithm.

Figures (20)

  • Figure 1: Main idea of this paper. On the left, we plot the performance of BCO(0) and AIME together with their oracle versions, which remove the , w.r.t. different number of demonstrations on walker-run task. For each setting with the same number of the demonstrations, i.e. each column, the value difference between the oracle version and the expert is the Demonstration Knowledge Barrier (DKB) while the value difference between the algorithm and its oracle version represents the Embodiment Knowledge Barrier (EKB). On the right, we present the solutions proposed in this paper to overcome the two barriers. The blue parts represent the original version of the algorithms that suffer from the knowledge barriers. Orange parts demonstrate the solution for , where the agent is allowed to interact with the environment and use $D_{\mathrm{online}}$ together with $D_{\mathrm{body}}$ to update the world model. Purple parts show the solution for , where a surrogate reward model is derived from $D_{\mathrm{demo}}$ and used to label $D_{\mathrm{online}}$ and then used as an RL signal for policy learning.
  • Figure 2: Comparing AIME-NoB with other algorithms. The figures show aggregated IQM scores on 9 DMC tasks and 6 MetaWorld tasks. All the algorithms are evaluated with 5 seeds on each task and the shaded region representing 95% CI.
  • Figure 3: Performance of AIME-NoB, AIME-NoEKB, MBBC, AIME w.r.t. different number of demonstrations on the walker-run task. For AIME-NoB, we do not show the result for more than 20 demonstrations since it is already saturated to the expert. All results are averaged across 5 seeds with the shaded region representing a 95% CI.
  • Figure 4: Ablations of different variants of AIME-NoB and choices of the embodiment datasets. All the algorithms are evaluated with 5 seeds with the shaded region representing 95% CI.
  • Figure 5: Ablations of replay ratio $\alpha$ on walker-run task. AIME-NoB is running with 10 demonstrations, while AIME-NoEKB is with 100 demonstrations. Action MSE is only shown for the first $10^5$ env steps. All results are averaged across 5 seeds with the shaded region representing a 95% CI.
  • ...and 15 more figures