Table of Contents
Fetching ...

FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming

TL;DR

Fasibility Informed Advantage Weighted Actor-Critic (FAWAC) is introduced, a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs) and proposes a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe.

Abstract

Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints, using only offline data for training. A key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution (OOD) states and actions, which can lead to safety violations or overly conservative behavior during deployment. To address these challenges, we introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs). FAWAC formulates policy optimization with feasibility conditions derived specifically for offline datasets, enabling safe policy updates in non-parametric policy space, followed by projection into parametric space for constrained actor training. By incorporating a cost-advantage term into Advantage Weighted Regression (AWR), FAWAC ensures that the safety constraints are respected while maximizing performance. Additionally, we propose a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe. Empirical evaluations on standard benchmarks demonstrate that FAWAC achieves strong results, effectively balancing safety and performance in learning policies from the static datasets.

FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

TL;DR

Fasibility Informed Advantage Weighted Actor-Critic (FAWAC) is introduced, a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs) and proposes a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe.

Abstract

Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints, using only offline data for training. A key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution (OOD) states and actions, which can lead to safety violations or overly conservative behavior during deployment. To address these challenges, we introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs). FAWAC formulates policy optimization with feasibility conditions derived specifically for offline datasets, enabling safe policy updates in non-parametric policy space, followed by projection into parametric space for constrained actor training. By incorporating a cost-advantage term into Advantage Weighted Regression (AWR), FAWAC ensures that the safety constraints are respected while maximizing performance. Additionally, we propose a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe. Empirical evaluations on standard benchmarks demonstrate that FAWAC achieves strong results, effectively balancing safety and performance in learning policies from the static datasets.

Paper Structure

This paper contains 19 sections, 3 theorems, 33 equations, 3 figures, 2 tables.

Key Result

lemma 1

The optimization problem induced by Eqs. achiam-adv-optim-problem to achiam-adv-optim-problem-probab-constraint is a convex optimization problem.

Figures (3)

  • Figure 1: (a) Block diagram illustrating the FAWAC training with a static dataset. Transitions are sampled to train the critic with TD loss and the actor with AWR loss, where the advantage is feasibility-informed, incorporating both cost-advantage and reward standard deviation components. (b) Diagram illustrating the distribution of trajectories in the cost-return vs. reward-return space. The goal in safe offline RL is to learn a policy that generates high-performing, safe trajectories.
  • Figure 2: Training curves depicting reward and cost returns for our method and baselines.
  • Figure 3: Ablation on the FAWAC hyperparameters $\hat{\nu}$ and $\lambda$.

Theorems & Definitions (6)

  • lemma 1
  • theorem 1
  • proposition 1
  • proof
  • proof
  • proof