Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning

Yassine Chemingui; Aryan Deshwal; Honghao Wei; Alan Fern; Janardhan Rao Doppa

Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning

Yassine Chemingui, Aryan Deshwal, Honghao Wei, Alan Fern, Janardhan Rao Doppa

TL;DR

This paper tackles offline safe reinforcement learning with test-time variability in safety constraints κ. It proposes Constraint-Adaptive Policy Switching (CAPS), a wrapper that trains multiple policies with a shared representation to span different reward-cost trade-offs, and switches between them at deployment using a two-step decision rule that filters unsafe actions and then selects the best safe action for reward. The authors provide a reduction-based training procedure that combines two offline RL runs with multiple policy extractions, and instantiate CAPS with both Implicit Q-Learning (IQL) and SAC+BC, including a shared-actor architecture to improve transfer. Theoretical safety guarantees are established under a bounded optimal-cost variation, and extensive experiments on 38 DSRL tasks show CAPS consistently outperforms baselines, with two-policy CAPS providing a robust and efficient baseline. Overall, CAPS offers a practical, scalable approach to adapting OSRL to varying deployment constraints, with substantial improvements in safety and reward across diverse environments.

Abstract

Offline safe reinforcement learning (OSRL) involves learning a decision-making policy to maximize rewards from a fixed batch of training data to satisfy pre-defined safety constraints. However, adapting to varying safety constraints during deployment without retraining remains an under-explored challenge. To address this challenge, we introduce constraint-adaptive policy switching (CAPS), a wrapper framework around existing offline RL algorithms. During training, CAPS uses offline data to learn multiple policies with a shared representation that optimize different reward and cost trade-offs. During testing, CAPS switches between those policies by selecting at each state the policy that maximizes future rewards among those that satisfy the current cost constraint. Our experiments on 38 tasks from the DSRL benchmark demonstrate that CAPS consistently outperforms existing methods, establishing a strong wrapper-based baseline for OSRL. The code is publicly available at https://github.com/yassineCh/CAPS.

Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning

TL;DR

Abstract

Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)