Table of Contents
Fetching ...

OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

Returaj Burnwal, Nirav Pravinbhai Bhatt, Balaraman Ravindran

TL;DR

This paper introduces OSIL, an offline safe imitation-learning algorithm that infers safety from a small set of non-preferred trajectories and a large set of high-return union trajectories. It casts safe policy learning as a CMDP and derives a computable lower bound on the reward objective using a learned cost model, trained via contrastive and preference-based losses on offline data. The policy is learned through a Lagrangian-regularized objective that blends behavior cloning from the union data with a cost-critic term, using an adaptive penalty to enforce safety while preserving performance. Empirically, OSIL yields safer policies with competitive returns across six MuJoCo and navigation tasks, outperforming several offline baselines and approaching constrained-RL performance under partial information. The work highlights practical offline safety learning from undesirable demonstrations, while noting the assumption that the union data sufficiently covers high-return, varying-cost behaviors as a key limitation and direction for future work.

Abstract

This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.

OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

TL;DR

This paper introduces OSIL, an offline safe imitation-learning algorithm that infers safety from a small set of non-preferred trajectories and a large set of high-return union trajectories. It casts safe policy learning as a CMDP and derives a computable lower bound on the reward objective using a learned cost model, trained via contrastive and preference-based losses on offline data. The policy is learned through a Lagrangian-regularized objective that blends behavior cloning from the union data with a cost-critic term, using an adaptive penalty to enforce safety while preserving performance. Empirically, OSIL yields safer policies with competitive returns across six MuJoCo and navigation tasks, outperforming several offline baselines and approaching constrained-RL performance under partial information. The work highlights practical offline safety learning from undesirable demonstrations, while noting the assumption that the union data sufficiently covers high-return, varying-cost behaviors as a key limitation and direction for future work.

Abstract

This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.
Paper Structure (36 sections, 5 theorems, 28 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 36 sections, 5 theorems, 28 equations, 20 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $\pi_U$ represents the policy that generated union dataset. Define $\epsilon = \max_{s,a} |Q^\pi_r(s,a) - V^\pi_r(s)|$, and $D^{\text{max}}_{\text{KL}}(\pi_U,\pi) = \max_s D_{\text{KL}}(\pi_U(.|s) || \pi(.|s))$. Then the following bound holds:

Figures (20)

  • Figure 1: (a) PPL works well when the union dataset consists of low-cost trajectories and SafeDICE requires that the union dataset contains low-cost or high-cost trajectories. This assumption is often unrealistic in practice since trajectories collected from real-world typically span a spectrum of safety costs. OSIL do not make any assumptions about the safety cost of a trajectory in the union dataset, as it may contain trajectories with varying levels of safety cost. (b) Under this setting, we observe that OSIL learns a high-return, safer (i.e., low-cost) policy, outperforming the best baseline by nearly 2.8x. We report the mean performance of the algorithm after 1 million training steps, aggregated across both velocity-constrained and navigation tasks. Mean and 95% CIs over 5 seeds.
  • Figure 2: Overview of the cost learning model. $f$ and $g$ are a learnable encoder and a linear model, respectively. The cost model is trained by minimizing the two loss function: $\mathcal{L}_{\text{cost}}^\text{const}$ encourages temporally adjacent state-action pairs within a trajectory to remain close in the learned representation space, and $\mathcal{L}_\text{cost}^\text{pref}$ ensures that the discounted cost of the trajectory $\tau_N$ is greater than trajectory $\tau_U$.
  • Figure 3: Performance Comparison. Experimental results on Walker2d-Velocity, Swimmer-Velocity, Ant-Velocity task. The shaded area represents the standard error. In velocity-constrained tasks, our method is able to recover safer policies without compromising reward performance.
  • Figure 4: Performance Comparison. Experimental results on Point-Circle2, Point-Goal1, Point-Button1 tasks. The shaded area represents the standard error. Similar to the results in Figure \ref{['fig:velocity_task_performance']}, our method is able to recover safer policy compared to other baselines.
  • Figure 5: Impact of Non-Preferred Trajectory Dataset Size. Experimental result on Point-Circle2 with varying non-preferred trajectory dataset size $\mathcal{D}_N=\{5, 10, 20, 50\}$. We observe the performance gradually decreases with smaller $|\mathcal{D}_N|$. However, our approach consistently outperforms all baselines across all different dataset size.
  • ...and 15 more figures

Theorems & Definitions (8)

  • Definition 1: Non-preferred Trajectory Dataset
  • Definition 2: Union Trajectory Dataset
  • Theorem 1
  • Lemma 1: approx_rl
  • Definition 3
  • Lemma 2: trpo_icml_2015
  • Lemma 3: pollard2000asymptopia
  • Theorem : 1