OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories
Returaj Burnwal, Nirav Pravinbhai Bhatt, Balaraman Ravindran
TL;DR
This paper introduces OSIL, an offline safe imitation-learning algorithm that infers safety from a small set of non-preferred trajectories and a large set of high-return union trajectories. It casts safe policy learning as a CMDP and derives a computable lower bound on the reward objective using a learned cost model, trained via contrastive and preference-based losses on offline data. The policy is learned through a Lagrangian-regularized objective that blends behavior cloning from the union data with a cost-critic term, using an adaptive penalty to enforce safety while preserving performance. Empirically, OSIL yields safer policies with competitive returns across six MuJoCo and navigation tasks, outperforming several offline baselines and approaching constrained-RL performance under partial information. The work highlights practical offline safety learning from undesirable demonstrations, while noting the assumption that the union data sufficiently covers high-return, varying-cost behaviors as a key limitation and direction for future work.
Abstract
This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.
