FOSP: Fine-tuning Offline Safe Policy through World Models
Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
TL;DR
The paper tackles safe generalization in vision-based robotic control by addressing the limitations of offline RL when facing unseen environments. It introduces FOSP, a model-based framework that pretrains a world-model-enabled agent offline with in-sample optimization, uses a reachability estimation function to enforce hard safety constraints, and applies safe policy expansion for online finetuning to avoid catastrophic degradation. The approach formulates safety as a CMDP with rewards $J^\mathcal{R}$ and costs $J^\mathcal{C}_i$, employing an Augmented Lagrangian and reachability-guided objective to derive a near-optimal policy under safety guarantees, including a closed-form update for $\pi_\psi$. Empirical results on five Safety-Gymnasium tasks and real-world SafeReach demonstrate strong offline performance, safe and efficient online adaptation, and superior generalization to unseen safety regions, highlighting practical value for rapid, safe deployment of vision-based robots. Overall, FOSP provides a principled offline-to-online safe RL pathway by unifying world-model-based planning, in-sample optimization, hard safety via REF, and safe policy expansion across offline and online phases, enabling few-shot safe adaptation without sim-to-real transfer.
Abstract
Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the deployment of vision-based robotic tasks through online fine-tuning an offline pretrained policy. To facilitate effective fine-tuning, we introduce model-based RL, which is known for its data efficiency. Specifically, our method employs in-sample optimization to improve offline training efficiency while incorporating reachability guidance to ensure safety. After obtaining an offline safe policy, a safe policy expansion approach is leveraged for online fine-tuning. The performance of our method is validated on simulation benchmarks with five vision-only tasks and through real-world robot deployment using limited data. It demonstrates that our approach significantly improves the generalization of offline policies to unseen safety-constrained scenarios. To the best of our knowledge, this is the first work to explore offline-to-online RL for safe generalization tasks.
