Programmatic Imitation Learning from Unlabeled and Noisy Demonstrations

Jimmy Xin; Linus Zheng; Kia Rahmani; Jiayi Wei; Jarrett Holtz; Isil Dillig; Joydeep Biswas

Programmatic Imitation Learning from Unlabeled and Noisy Demonstrations

Jimmy Xin, Linus Zheng, Kia Rahmani, Jiayi Wei, Jarrett Holtz, Isil Dillig, Joydeep Biswas

TL;DR

PLUNDER tackles imitation learning from unlabeled and noisy demonstrations by learning probabilistic programmatic policies (ASP) within an EM framework. It leverages a probabilistic DSL to represent transitions between high-level actions, uses a particle-filtered E-step to infer plausible action sequences, and a bottom-up inductive M-step with a complexity-penalized prior to synthesize refined policies. The approach yields high alignment with demonstrations (≈95% action-label accuracy) and strong task success (≈90%), while maintaining interpretability that enables straightforward repair and verification. Across five challenging benchmarks, PLUNDER demonstrates robustness to noise and data-efficient policy synthesis, offering a practical, human-readable alternative to opaque neural policies with competitive performance.

Abstract

Imitation Learning (IL) is a promising paradigm for teaching robots to perform novel tasks using demonstrations. Most existing approaches for IL utilize neural networks (NN), however, these methods suffer from several well-known limitations: they 1) require large amounts of training data, 2) are hard to interpret, and 3) are hard to repair and adapt. There is an emerging interest in programmatic imitation learning (PIL), which offers significant promise in addressing the above limitations. In PIL, the learned policy is represented in a programming language, making it amenable to interpretation and repair. However, state-of-the-art PIL algorithms assume access to action labels and struggle to learn from noisy real-world demonstrations. In this paper, we propose PLUNDER, a novel PIL algorithm that integrates a probabilistic program synthesizer in an iterative Expectation-Maximization (EM) framework to address these shortcomings. Unlike existing PIL approaches, PLUNDER synthesizes probabilistic programmatic policies that are particularly well-suited for modeling the uncertainties inherent in real-world demonstrations. Our approach leverages an EM loop to simultaneously infer the missing action labels and the most likely probabilistic policy. We benchmark PLUNDER against several established IL techniques, and demonstrate its superiority across five challenging imitation learning tasks under noise. PLUNDER policies achieve 95% accuracy in matching the given demonstrations, outperforming the next best baseline by 19%. Additionally, policies generated by PLUNDER successfully complete the tasks 17% more frequently than the nearest baseline.

Programmatic Imitation Learning from Unlabeled and Noisy Demonstrations

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 9 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Related Work
Problem Formulation
Example: Stop-Sign
The Plunder Algorithm
Probabilistic ASPs
Expectation (E) Step
Maximization (M) Step
Example: Stop-Sign
Experimental Evaluations
Baselines
Alignment with Demonstrations
Task Completion Rate
Convergence
Impact of Noise
...and 3 more sections

Figures (8)

Figure 1: Overview of Plunder
Figure 2: Demonstration trajectories for the Stop Sign task. The acceleration value of this particular vehicle cannot exceed $a_{\max}\approx13m/s^2$ or drop below $a_{\min}\approx-20m/s^2$.
Figure 3: Grammar of ASPs. Here, $y_t,a_{t-1}$ are inputs representing the current state and previous action, $c$ is a constant, $A$ is an action, and $g$ is a built-in ($+, \times$ etc) or domain-specific feature extraction function (e.g., timeToStop).
Figure 4: Best candidate programs found at each iteration and the corresponding action sequence samples. The ground-truth sequence is shown at the top. Only $\phi_{\textcolor{my_red}{\mathtt{ACC}},\textcolor{my_cyan}{\mathtt{CON}}}$ and $\phi_{\textcolor{my_cyan}{\mathtt{CON}},\textcolor{my_blue}{\mathtt{DEC}}}$ from each policy are shown due to space constraints.
Figure 5: Accuracy of Action Labels
...and 3 more figures

Programmatic Imitation Learning from Unlabeled and Noisy Demonstrations

TL;DR

Abstract

Programmatic Imitation Learning from Unlabeled and Noisy Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)