FOSP: Fine-tuning Offline Safe Policy through World Models

Chenyang Cao; Yucheng Xin; Silang Wu; Longxiang He; Zichen Yan; Junbo Tan; Xueqian Wang

FOSP: Fine-tuning Offline Safe Policy through World Models

Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang

TL;DR

The paper tackles safe generalization in vision-based robotic control by addressing the limitations of offline RL when facing unseen environments. It introduces FOSP, a model-based framework that pretrains a world-model-enabled agent offline with in-sample optimization, uses a reachability estimation function to enforce hard safety constraints, and applies safe policy expansion for online finetuning to avoid catastrophic degradation. The approach formulates safety as a CMDP with rewards $J^\mathcal{R}$ and costs $J^\mathcal{C}_i$, employing an Augmented Lagrangian and reachability-guided objective to derive a near-optimal policy under safety guarantees, including a closed-form update for $\pi_\psi$. Empirical results on five Safety-Gymnasium tasks and real-world SafeReach demonstrate strong offline performance, safe and efficient online adaptation, and superior generalization to unseen safety regions, highlighting practical value for rapid, safe deployment of vision-based robots. Overall, FOSP provides a principled offline-to-online safe RL pathway by unifying world-model-based planning, in-sample optimization, hard safety via REF, and safe policy expansion across offline and online phases, enabling few-shot safe adaptation without sim-to-real transfer.

Abstract

Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the deployment of vision-based robotic tasks through online fine-tuning an offline pretrained policy. To facilitate effective fine-tuning, we introduce model-based RL, which is known for its data efficiency. Specifically, our method employs in-sample optimization to improve offline training efficiency while incorporating reachability guidance to ensure safety. After obtaining an offline safe policy, a safe policy expansion approach is leveraged for online fine-tuning. The performance of our method is validated on simulation benchmarks with five vision-only tasks and through real-world robot deployment using limited data. It demonstrates that our approach significantly improves the generalization of offline policies to unseen safety-constrained scenarios. To the best of our knowledge, this is the first work to explore offline-to-online RL for safe generalization tasks.

FOSP: Fine-tuning Offline Safe Policy through World Models

TL;DR

and costs

, employing an Augmented Lagrangian and reachability-guided objective to derive a near-optimal policy under safety guarantees, including a closed-form update for

. Empirical results on five Safety-Gymnasium tasks and real-world SafeReach demonstrate strong offline performance, safe and efficient online adaptation, and superior generalization to unseen safety regions, highlighting practical value for rapid, safe deployment of vision-based robots. Overall, FOSP provides a principled offline-to-online safe RL pathway by unifying world-model-based planning, in-sample optimization, hard safety via REF, and safe policy expansion across offline and online phases, enabling few-shot safe adaptation without sim-to-real transfer.

Abstract

Paper Structure (63 sections, 3 theorems, 50 equations, 23 figures, 5 tables, 1 algorithm)

This paper contains 63 sections, 3 theorems, 50 equations, 23 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Offline-to-online RL
Safe RL
Preliminaries
World Model
Safe Model-based RL
Reachability Estimation Function
Methods
In-sample Optimization for Offline Training
Reachability Estimation Function as Safety Guarantee
Safe Policy Expansion for World Models Fine-tuning
Experimental Results
Experimental Setup
Simulation Tasks
...and 48 more sections

Key Result

Proposition 1

The optimization objective of equation eq:feasible is the necessary condition of $\max_{\pi}\mathbb{E}_{{\bm{s}}}[A^r({\bm{s}},{\bm{a}})\cdot \mathds{1}\{s\in \mathcal{S}_f\}]$.

Figures (23)

Figure 1: Fine-tuning offline safe policy through world models. We propose a framework for offline pretraining and online fine-tuning the world model. We first pretrain the agent by the offline dataset and rollouts generated from world models. The grey section depicts the architecture of the world model: it first encodes an image observation into its latent state $s_0$, then, for each latent state, generates an action using the policy, as well as predicts the reward, cost, and next state. In the offline-to-online phase, we employ policy expansion to initialize a new policy for online fine-tuning. The pretrained Q-value is leveraged to construct a softmax probability distribution. Then, we select an action by this distribution for the agent to safely interact with the real world, generalizing it to novel tasks.
Figure 2: Safety insurance in FOSP. We enable a safe policy to predict the probability of constraint violations in the future. It can maintain persistent safety in the feasible set and reenter the feasible set as soon as possible when in the infeasible set.
Figure 3: Online experimental results. Comparing FOSP to baselines across five image-based safety tasks at the online fine-tuning stage. The results for model-based algorithms are obtained after fine-tuning for 750,000 steps. The dashed lines represent the benchmark results for CPO and PPO-Lagrangian after 10 million training steps across all tasks. The SafeDreamer (planning) was trained online for 0.75 million steps. Reward: averaged episode reward return. Cost: averaged episode cost return. Cost Regret: averaged cost value throughout the training phase.
Figure 3: Real-world unseen tasks. We record the success rate (SR, %) and the constraint violation rate (CV, %) over 20 tests in three tasks while it will be labeled as a violation if it collides with an obstacle. The robot has fine-tuned 40 gradient steps. See Appendix.\ref{['apx:real']} for more details.
Figure 4: Module ablation studies. We evaluate ablations in SafetyPointGoal2 with means of five seeds. The vertical line divides the offline and online phases.
...and 18 more figures

Theorems & Definitions (3)

Proposition 1
Proposition 2
Proposition 3

FOSP: Fine-tuning Offline Safe Policy through World Models

TL;DR

Abstract

FOSP: Fine-tuning Offline Safe Policy through World Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (3)