Zero-shot Imitation Policy via Search in Demonstration Dataset

Federco Malato; Florian Leopold; Andrew Melnik; Ville Hautamaki

Zero-shot Imitation Policy via Search in Demonstration Dataset

Federco Malato, Florian Leopold, Andrew Melnik, Ville Hautamaki

TL;DR

This work introduces Zero-shot Imitation Policy (ZIP), a search-based imitation framework that avoids extensive training by indexing a dataset of expert demonstrations in a latent space produced by a pretrained Video Pre-Training model. At test time, ZIP retrieves the most similar past situation using the $L_1$ distance between embeddings and imitates its actions, switching references when latent divergence or time thresholds are reached. Across MineRL BASALT FindCave experiments, ZIP yields strong perceptual evaluations and the highest quantitative success rate among evaluated agents, while requiring significantly less training time than traditional imitation-learning baselines. The results demonstrate effective zero-shot adaptation in discrete-action environments and point to future improvements via scalable latent-space indexing and relevance-aware data ranking.

Abstract

Behavioral cloning uses a dataset of demonstrations to learn a policy. To overcome computationally expensive training procedures and address the policy adaptation problem, we propose to use latent spaces of pre-trained foundation models to index a demonstration dataset, instantly access similar relevant experiences, and copy behavior from these situations. Actions from a selected similar situation can be performed by the agent until representations of the agent's current situation and the selected experience diverge in the latent space. Thus, we formulate our control problem as a dynamic search problem over a dataset of experts' demonstrations. We test our approach on BASALT MineRL-dataset in the latent representation of a Video Pre-Training model. We compare our model to state-of-the-art, Imitation Learning-based Minecraft agents. Our approach can effectively recover meaningful demonstrations and show human-like behavior of an agent in the Minecraft environment in a wide variety of scenarios. Experimental results reveal that performance of our search-based approach clearly wins in terms of accuracy and perceptual evaluation over learning-based models.

Zero-shot Imitation Policy via Search in Demonstration Dataset

TL;DR

distance between embeddings and imitates its actions, switching references when latent divergence or time thresholds are reached. Across MineRL BASALT FindCave experiments, ZIP yields strong perceptual evaluations and the highest quantitative success rate among evaluated agents, while requiring significantly less training time than traditional imitation-learning baselines. The results demonstrate effective zero-shot adaptation in discrete-action environments and point to future improvements via scalable latent-space indexing and relevance-aware data ranking.

Abstract

Paper Structure (11 sections, 7 figures, 1 table)

This paper contains 11 sections, 7 figures, 1 table.

Introduction
Methods
Zero-shot Imitation Policy
Experiments
Results
Perceptual evaluation
Quantitative results
Ablation study
Latent space visualisation
Conclusions
Acknowledgements

Figures (7)

Figure 1: A scheme of the VPT model used in this study. An image input is encoded with an IMPALA CNN and passed through four transformer heads. Then, two MLP heads predict a keyboard and a mouse action respectively.
Figure 2: Our approach. (A) Latent space generation: trajectories are extracted from the demonstration dataset. Frames are encoded through a provided VPT model, and paired with the corresponding actions. (B) Evaluation loop: at each time-step, the new observation is forwarded to the same VPT model. Then, L1 distance across current and reference embeddings is computed and the most similar situation is found. ZIP acts in the environment following the actions of the selected reference situation.
Figure 3: An example of the search mechanism. At each time-step, we keep track of the distance between current and reference embedding. Whenever the distance overcomes a threshold, a divergence-based search (red line) selects a new reference embedding; if the agent follows a threshold for too long, a time-based search (blue line) is triggered. For each segment of the episode a yellow, dashed line indicates the value of the reference distance. A brown diamond corresponding to each red line shows the distance value that triggered the new search.
Figure 4: Time needed to train each agent on 100 trajectories on the FindCave task. In the case of BC models, the training procedure consists of fine-tuning a pre-trained VPT model. For ZIP, training means encoding a subset of trajectories through the reference version of VPT. All models have been trained on a single Tesla T4 GPU.
Figure 5: Average success rate for tested models on the FindCave task. Each agent has been evaluated on a batch of $20$ seeds. Each run has been repeated three times. Baseline model is highlighted with a vertical blue line.
...and 2 more figures

Zero-shot Imitation Policy via Search in Demonstration Dataset

TL;DR

Abstract

Zero-shot Imitation Policy via Search in Demonstration Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (7)