Table of Contents
Fetching ...

On the Design of Safe Continual RL Methods for Control of Nonlinear Systems

Austin Coursey, Marcos Quinones-Grueiro, Gautam Biswas

TL;DR

The paper tackles safe continual reinforcement learning by examining how safety guarantees interact with continual adaptation under abrupt non-stationarities. It empirically compares constrained policy optimization (CPO), proximal policy optimization with elastic weight consolidation (PPO+EWC), and a reward-shaped Safe EWC on velocity-constrained MuJoCo HalfCheetah and Ant tasks with sudden limb removals to simulate faults. CPO preserves safety but suffers catastrophic forgetting, while PPO+EWC forgets less yet often neglects safety; Safe EWC, achieved by penalizing safety violations through reward shaping, offers a practical balance with reduced forgetting and competitive safety. The findings suggest that simple safe-continual extensions can be effective and motivate richer mechanisms to integrate safety with continual learning for real-world, non-stationary control systems.

Abstract

Reinforcement learning (RL) algorithms have been successfully applied to control tasks associated with unmanned aerial vehicles and robotics. In recent years, safe RL has been proposed to allow the safe execution of RL algorithms in industrial and mission-critical systems that operate in closed loops. However, if the system operating conditions change, such as when an unknown fault occurs in the system, typical safe RL algorithms are unable to adapt while retaining past knowledge. Continual reinforcement learning algorithms have been proposed to address this issue. However, the impact of continual adaptation on the system's safety is an understudied problem. In this paper, we study the intersection of safe and continual RL. First, we empirically demonstrate that a popular continual RL algorithm, online elastic weight consolidation, is unable to satisfy safety constraints in non-linear systems subject to varying operating conditions. Specifically, we study the MuJoCo HalfCheetah and Ant environments with velocity constraints and sudden joint loss non-stationarity. Then, we show that an agent trained using constrained policy optimization, a safe RL algorithm, experiences catastrophic forgetting in continual learning settings. With this in mind, we explore a simple reward-shaping method to ensure that elastic weight consolidation prioritizes remembering both safety and task performance for safety-constrained, non-linear, and non-stationary dynamical systems.

On the Design of Safe Continual RL Methods for Control of Nonlinear Systems

TL;DR

The paper tackles safe continual reinforcement learning by examining how safety guarantees interact with continual adaptation under abrupt non-stationarities. It empirically compares constrained policy optimization (CPO), proximal policy optimization with elastic weight consolidation (PPO+EWC), and a reward-shaped Safe EWC on velocity-constrained MuJoCo HalfCheetah and Ant tasks with sudden limb removals to simulate faults. CPO preserves safety but suffers catastrophic forgetting, while PPO+EWC forgets less yet often neglects safety; Safe EWC, achieved by penalizing safety violations through reward shaping, offers a practical balance with reduced forgetting and competitive safety. The findings suggest that simple safe-continual extensions can be effective and motivate richer mechanisms to integrate safety with continual learning for real-world, non-stationary control systems.

Abstract

Reinforcement learning (RL) algorithms have been successfully applied to control tasks associated with unmanned aerial vehicles and robotics. In recent years, safe RL has been proposed to allow the safe execution of RL algorithms in industrial and mission-critical systems that operate in closed loops. However, if the system operating conditions change, such as when an unknown fault occurs in the system, typical safe RL algorithms are unable to adapt while retaining past knowledge. Continual reinforcement learning algorithms have been proposed to address this issue. However, the impact of continual adaptation on the system's safety is an understudied problem. In this paper, we study the intersection of safe and continual RL. First, we empirically demonstrate that a popular continual RL algorithm, online elastic weight consolidation, is unable to satisfy safety constraints in non-linear systems subject to varying operating conditions. Specifically, we study the MuJoCo HalfCheetah and Ant environments with velocity constraints and sudden joint loss non-stationarity. Then, we show that an agent trained using constrained policy optimization, a safe RL algorithm, experiences catastrophic forgetting in continual learning settings. With this in mind, we explore a simple reward-shaping method to ensure that elastic weight consolidation prioritizes remembering both safety and task performance for safety-constrained, non-linear, and non-stationary dynamical systems.

Paper Structure

This paper contains 10 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Task sequence for safe continual reinforcement learning. The top sequence is the MuJoCo HalfCheetah. The bottom is the Ant. Task changes occur every 1 million training timesteps and the cycle repeats. The tasks are designed to replicate a challenging and drastic change in operating mode caused by equipment being repaired or suddenly breaking due to physical damage or a fault. The objective for the environments is to travel as far as possible in a fixed amount of time while maintaining velocity constrained (visualized by the green bubble).
  • Figure 2: Rewards and costs during training with task changes for the HalfCheetah environment. The tasks, shown by the background color, correspond to the tasks shown in Fig. \ref{['fig:task_sequence']}.
  • Figure 3: Immediate reward when experiencing nominal dynamics for the HalfCheetah. This measures how well the policy under nominal conditions is remembered.
  • Figure 4: Rewards and costs during training with task changes for the Ant environment. The tasks, shown by the background color, correspond to the tasks shown in Fig. \ref{['fig:task_sequence']}.