Offline Learning from Demonstrations and Unlabeled Experience
Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, Scott Reed
TL;DR
Offline Reinforced Imitation Learning (ORIL) enables learning offline from a small set of demonstrations plus a large pool of unlabeled, mixed-quality trajectories without reward annotations. It learns a reward model by contrasting expert versus unlabeled data using PU learning and TRAIL-inspired regularization, annotates all data with the learned reward, and then trains an offline RL agent via Critic-Regularized Regression. Across Robotic Manipulation and DeepMind Control Suite tasks, ORIL consistently outperforms standard behavior cloning baselines and approaches the performance of methods with ground-truth rewards, demonstrating robustness to unlabeled data quality and scalability with more unlabeled data. This approach reduces reliance on reward engineering and online interaction, enabling practical data-driven offline robotics.
Abstract
Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.
