Table of Contents
Fetching ...

Offline Safe Reinforcement Learning Using Trajectory Classification

Ze Gong, Akshat Kumar, Pradeep Varakantham

TL;DR

This paper proposes to learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories.

Abstract

Offline safe reinforcement learning (RL) has emerged as a promising approach for learning safe behaviors without engaging in risky online interactions with the environment. Most existing methods in offline safe RL rely on cost constraints at each time step (derived from global cost constraints) and this can result in either overly conservative policies or violation of safety constraints. In this paper, we propose to learn a policy that generates desirable trajectories and avoids undesirable trajectories. To be specific, we first partition the pre-collected dataset of state-action trajectories into desirable and undesirable subsets. Intuitively, the desirable set contains high reward and safe trajectories, and undesirable set contains unsafe trajectories and low-reward safe trajectories. Second, we learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories. This approach bypasses the computational complexity and stability issues of a min-max objective that is employed in existing methods. Theoretically, we also show our approach's strong connections to existing learning paradigms involving human feedback. Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.

Offline Safe Reinforcement Learning Using Trajectory Classification

TL;DR

This paper proposes to learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories.

Abstract

Offline safe reinforcement learning (RL) has emerged as a promising approach for learning safe behaviors without engaging in risky online interactions with the environment. Most existing methods in offline safe RL rely on cost constraints at each time step (derived from global cost constraints) and this can result in either overly conservative policies or violation of safety constraints. In this paper, we propose to learn a policy that generates desirable trajectories and avoids undesirable trajectories. To be specific, we first partition the pre-collected dataset of state-action trajectories into desirable and undesirable subsets. Intuitively, the desirable set contains high reward and safe trajectories, and undesirable set contains unsafe trajectories and low-reward safe trajectories. Second, we learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories. This approach bypasses the computational complexity and stability issues of a min-max objective that is employed in existing methods. Theoretically, we also show our approach's strong connections to existing learning paradigms involving human feedback. Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.

Paper Structure

This paper contains 29 sections, 14 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Visualization of trajectories in the reward-cost return space with a cost threshold of $40$. The x-axis is the total cost and the y-axis is the total reward. Each blue dot in the figure corresponds to a trajectory in the offline dataset. Based on the cost threshold, trajectories are categorized into safe (in the green area) and unsafe (in the pink area).
  • Figure 2: Visualization of normalized reward and cost for each task. The dotted blue vertical lines represent the cost threshold of 1. Each round dot represents a task, with green indicating safety and red indicating constraint violation.
  • Figure 3: Ratio of tasks solved regarding safety.
  • Figure 4: Results of normalized reward and normalized cost with various $x\%$ in two task.
  • Figure 5: Results of normalized reward and normalized cost with various $y\%$ in two tasks.
  • ...and 12 more figures