Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

Huy Hoang; Tien Mai; Pradeep Varakantham; Tanvi Verma

Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

Huy Hoang, Tien Mai, Pradeep Varakantham, Tanvi Verma

TL;DR

This work addresses offline imitation learning from datasets containing both expert and undesirable demonstrations by formulating a difference-of-KL objective $f(d^\pi)=D_{KL}(d^\pi\|d^G)-\alpha D_{KL}(d^\pi\|d^B)$ and showing convexity when $\alpha\le1$. By applying Lagrangian duality, the authors derive a tractable, non-adversarial Q-learning objective with a KL-based reference $d^U$ and a correction term $\Psi(s,a)$, along with a surrogate lower bound that is linear in $Q$ and concave in policy. They introduce a practical Q-weighted BC policy extraction method, and estimate occupancy ratios via discriminators, enabling stable training without adversarial components. Empirical results on standard offline IL benchmarks (D4RL) demonstrate consistent improvements over baselines across locomotion and manipulation tasks, even as the amount of bad data grows and the balancing parameter $\alpha$ is varied. Overall, ContraDICE provides a principled, scalable framework for learning from contrasting demonstrations in offline settings, with strong practical impact for safely leveraging negative examples.

Abstract

Offline imitation learning typically learns from expert and unlabeled demonstrations, yet often overlooks the valuable signal in explicitly undesirable behaviors. In this work, we study offline imitation learning from contrasting behaviors, where the dataset contains both expert and undesirable demonstrations. We propose a novel formulation that optimizes a difference of KL divergences over the state-action visitation distributions of expert and undesirable (or bad) data. Although the resulting objective is a DC (Difference-of-Convex) program, we prove that it becomes convex when expert demonstrations outweigh undesirable demonstrations, enabling a practical and stable non-adversarial training objective. Our method avoids adversarial training and handles both positive and negative demonstrations in a unified framework. Extensive experiments on standard offline imitation learning benchmarks demonstrate that our approach consistently outperforms state-of-the-art baselines.

Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

TL;DR

This work addresses offline imitation learning from datasets containing both expert and undesirable demonstrations by formulating a difference-of-KL objective

and showing convexity when

. By applying Lagrangian duality, the authors derive a tractable, non-adversarial Q-learning objective with a KL-based reference

and a correction term

, along with a surrogate lower bound that is linear in

and concave in policy. They introduce a practical Q-weighted BC policy extraction method, and estimate occupancy ratios via discriminators, enabling stable training without adversarial components. Empirical results on standard offline IL benchmarks (D4RL) demonstrate consistent improvements over baselines across locomotion and manipulation tasks, even as the amount of bad data grows and the balancing parameter

is varied. Overall, ContraDICE provides a principled, scalable framework for learning from contrasting demonstrations in offline settings, with strong practical impact for safely leveraging negative examples.

Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

TL;DR

Abstract

Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (93)

Theorems & Definitions (10)