Table of Contents
Fetching ...

A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, Gokul Swamy

TL;DR

This work tackles the brittleness of behavioral cloning in imitation learning by addressing recovery from mistakes through learning to search. SAILOR jointly learns a world model and a reward model to enable test-time planning of residual actions via MPPI, operating within the latent space of a base policy. Across twelve long-horizon visual manipulation tasks, SAILOR consistently surpasses diffusion-policy-based BC and shows data-efficiency, robustness to reward hacking, and effective recovery without additional human feedback. The approach demonstrates that fusing learning and search yields scalable, robust imitation with practical implications for real-robot autonomy and broader foundation-model integration.

Abstract

The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10x still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .

A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

TL;DR

This work tackles the brittleness of behavioral cloning in imitation learning by addressing recovery from mistakes through learning to search. SAILOR jointly learns a world model and a reward model to enable test-time planning of residual actions via MPPI, operating within the latent space of a base policy. Across twelve long-horizon visual manipulation tasks, SAILOR consistently surpasses diffusion-policy-based BC and shows data-efficiency, robustness to reward hacking, and effective recovery without additional human feedback. The approach demonstrates that fusing learning and search yields scalable, robust imitation with practical implications for real-robot autonomy and broader foundation-model integration.

Abstract

The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10x still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .

Paper Structure

This paper contains 18 sections, 6 equations, 13 figures, 3 tables, 2 algorithms.

Figures (13)

  • Figure 1: We introduce SAILOR, a method for learning to search from expert demonstrations. By learning world and reward models on a mixture of expert and base-policy data, we endow the agent with the ability to, at test time, reason about how to recover from mistakes that the base policy makes.
  • Figure 2: Left: we see SAILOR consistently out-perform diffusion policies trained on the same demos across various visual manipulation problems at multiple dataset scales $|\mathcal{D}|$. Right: SAILOR's learned reward model is able to detect shared prefixes (black dots and frames), base policy failures (purple dots, lines, and frames) and SAILOR's successes (orange dots, lines, and frames).
  • Figure 3: At inference time, SAILOR performs a search for residual plans to correct mistakes in the base policy's nominal plan in the latent world model WM against the learned reward model RM and critic V. It then executes the first step of the best corrected plan before re-planning, MPC-style.
  • Figure 4: Across 12 visual manipulation problems from 3 benchmarks, SAILOR consistently outperforms diffusion policy (DP) trained on the same demos, where $|\mathcal{D}|$ denotes the number of demos.
  • Figure 5: We see that simply scaling up the amount of demos $|\mathcal{D}|$ used for training DP via behavioral cloning by 5-10$\times$ often plateaus in performance and is unable to match the performance of SAILOR.
  • ...and 8 more figures