Table of Contents
Fetching ...

Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach

Rory Young, Nicolas Pugeault

TL;DR

An improvement on the successful Dreamer V3 architecture is proposed, implementing Maximal Lyapunov Exponent regularisation, which reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of deep reinforcement learning for real-world applications.

Abstract

Deep reinforcement learning agents achieve state-of-the-art performance in a wide range of simulated control tasks. However, successful applications to real-world problems remain limited. One reason for this dichotomy is because the learnt policies are not robust to observation noise or adversarial attacks. In this paper, we investigate the robustness of deep RL policies to a single small state perturbation in deterministic continuous control tasks. We demonstrate that RL policies can be deterministically chaotic, as small perturbations to the system state have a large impact on subsequent state and reward trajectories. This unstable non-linear behaviour has two consequences: first, inaccuracies in sensor readings, or adversarial attacks, can cause significant performance degradation; second, even policies that show robust performance in terms of rewards may have unpredictable behaviour in practice. These two facets of chaos in RL policies drastically restrict the application of deep RL to real-world problems. To address this issue, we propose an improvement on the successful Dreamer V3 architecture, implementing Maximal Lyapunov Exponent regularisation. This new approach reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of deep reinforcement learning for real-world applications.

Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach

TL;DR

An improvement on the successful Dreamer V3 architecture is proposed, implementing Maximal Lyapunov Exponent regularisation, which reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of deep reinforcement learning for real-world applications.

Abstract

Deep reinforcement learning agents achieve state-of-the-art performance in a wide range of simulated control tasks. However, successful applications to real-world problems remain limited. One reason for this dichotomy is because the learnt policies are not robust to observation noise or adversarial attacks. In this paper, we investigate the robustness of deep RL policies to a single small state perturbation in deterministic continuous control tasks. We demonstrate that RL policies can be deterministically chaotic, as small perturbations to the system state have a large impact on subsequent state and reward trajectories. This unstable non-linear behaviour has two consequences: first, inaccuracies in sensor readings, or adversarial attacks, can cause significant performance degradation; second, even policies that show robust performance in terms of rewards may have unpredictable behaviour in practice. These two facets of chaos in RL policies drastically restrict the application of deep RL to real-world problems. To address this issue, we propose an improvement on the successful Dreamer V3 architecture, implementing Maximal Lyapunov Exponent regularisation. This new approach reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of deep reinforcement learning for real-world applications.

Paper Structure

This paper contains 17 sections, 6 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Reward attained when a trained deterministic Soft Actor-Critic haarnoja_2018_soft agent controls the deterministic Walker Walk environment. Each system has the same initial configuration other than the torso angle, which is perturbed by $\pm5\times10^{-4}$ degrees. This small perturbation causes the systems to significantly diverge after 50 steps due to the chaotic nature of the control interaction. Consequently, this affects overall performance as there is significant variation in the total reward attained.
  • Figure 2: Total episode reward for the Pointmass Easy (PM), Cartpole Balance (CB), Cartpole Swingup (CS), Walker Stand (WS), Walker Walk (WW), Walker Run (WR) and Cheetah Run (CR) environments when controlled by trained instances of SAC, TD3, Dreamer V3 (DR3) and an agent which takes no actions (None). Each policy-environment combination is independently trained with three random seeds and the average interquartile episode reward with a bootstrapped 95% confidence interval is reported over 80 evaluation episodes each with a fixed length of 1000 steps.
  • Figure 3: Estimated Maximal Lyapunov Exponent (MLE) and Sum of Lyapunov Exponents (SLE) for the Pointmass (PM), Cartpole Balance (CB), Cartpole Swingup (CS), Walker Stand (WS), Walker Walk (WW), Walker Run (WR) and Cheetah Run (CR) environments when controlled by a trained instance of SAC, TD3, Dreamer V3 (DR3) and an agent which takes no actions (None). Each policy-environment combination is independently trained with three random seeds and the interquartile average MLE & SLE for each seed is calculated using 20 initial states. A bootstrapped 95% confidence interval is included to show the variation in MLE and SLE across random seeds.
  • Figure 4: Partial state trajectory produced by Dreamer V3 when controlling Cartpole Balance and Walker Walk subject to a single initial state perturbation. Initially, each system is separated by only $10^{-4}$ units but the subsequent state trajectories diverge significantly as the control interaction is chaotic.
  • Figure 5: Reward MLE interquartile mean for the Pointmass (PM), Cartpole Balance (CB), Cartpole Swingup (CS), Walker Stand (WS), Walker Walk (WW), Walker Run (WR) and Cheetah Run (CR) when controlled by SAC, TD3 and Dreamer V3 (DR3). Each policy-environment combination is independently trained with three random seeds and the reward MLE for each seed is calculated using 20 initial states. A bootstrapped 95% confidence interval is included to show the variation in reward stability across random seeds.
  • ...and 5 more figures