RAIL: Reachability-Aided Imitation Learning for Safe Policy Execution
Wonsuhk Jung, Dennis Anthony, Utkarsh A. Mishra, Nadun Ranawaka Arachchige, Matthew Bronars, Danfei Xu, Shreyas Kousik
TL;DR
RAIL addresses the problem of enforcing hard safety constraints in imitation learning for robotics by overlaying a reachability-based safety filter on top of state-of-the-art IL policies. It couples offline IL with a receding-horizon safety wrapper consisting of a head-plan verifier and a low-dimensional backup planner, and employs continuous-time collision checking to guarantee safety. The core technical contribution is overapproximating the robot's continuous swept volume via polynomial zonotopes to enable fast, provable collision checks during planning. Empirical results across simulation and real hardware show that RAIL achieves 0% collision rates while maintaining or sometimes improving task success, demonstrating that hard safety constraints can enhance performance and are feasible in real time (about $0.42 \pm 0.05$ s per plan). The framework is compatible with diffusion-based IL and other offline IL methods, and opens avenues for integrating safety directly into IL training to reduce planning conflicts.
Abstract
Imitation learning (IL) has shown great success in learning complex robot manipulation tasks. However, there remains a need for practical safety methods to justify widespread deployment. In particular, it is important to certify that a system obeys hard constraints on unsafe behavior in settings when it is unacceptable to design a tradeoff between performance and safety via tuning the policy (i.e. soft constraints). This leads to the question, how does enforcing hard constraints impact the performance (meaning safely completing tasks) of an IL policy? To answer this question, this paper builds a reachability-based safety filter to enforce hard constraints on IL, which we call Reachability-Aided Imitation Learning (RAIL). Through evaluations with state-of-the-art IL policies in mobile robots and manipulation tasks, we make two key findings. First, the highest-performing policies are sometimes only so because they frequently violate constraints, and significantly lose performance under hard constraints. Second, surprisingly, hard constraints on the lower-performing policies can occasionally increase their ability to perform tasks safely. Finally, hardware evaluation confirms the method can operate in real time.
