Table of Contents
Fetching ...

Deep Reinforcement Learning Behavioral Mode Switching Using Optimal Control Based on a Latent Space Objective

Sindre Benjamin Remman, Bjørn Andreas Kristiansen, Anastasios M. Lekkas

TL;DR

This work tackles the interpretability and controllability of DRL policies by revealing latent-space behavioral modes via PaCMAP and actively steering the agent between modes using latent-space optimized control. The authors formulate an NLP that minimizes the distance between a latent-space target and the policy's latent projection at the horizon, solved with IPOPT through CasADi, and they apply it to a deterministic LunarLander-v2 setting. Key contributions include identifying behavioral modes in policy latent spaces, proposing a latent-space objective for mode switching, and demonstrating substantial reward improvements when switching from failed to successful modes, as well as reverse-capability. The approach provides a practical pathway for influencing and understanding DRL policies without requiring explicit full-environment models, with potential impact on policy interpretability and controllability in safety-critical domains, subject to extending to more complex environments.

Abstract

In this work, we use optimal control to change the behavior of a deep reinforcement learning policy by optimizing directly in the policy's latent space. We hypothesize that distinct behavioral patterns, termed behavioral modes, can be identified within certain regions of a deep reinforcement learning policy's latent space, meaning that specific actions or strategies are preferred within these regions. We identify these behavioral modes using latent space dimension-reduction with \ac*{pacmap}. Using the actions generated by the optimal control procedure, we move the system from one behavioral mode to another. We subsequently utilize these actions as a filter for interpreting the neural network policy. The results show that this approach can impose desired behavioral modes in the policy, demonstrated by showing how a failed episode can be made successful and vice versa using the lunar lander reinforcement learning environment.

Deep Reinforcement Learning Behavioral Mode Switching Using Optimal Control Based on a Latent Space Objective

TL;DR

This work tackles the interpretability and controllability of DRL policies by revealing latent-space behavioral modes via PaCMAP and actively steering the agent between modes using latent-space optimized control. The authors formulate an NLP that minimizes the distance between a latent-space target and the policy's latent projection at the horizon, solved with IPOPT through CasADi, and they apply it to a deterministic LunarLander-v2 setting. Key contributions include identifying behavioral modes in policy latent spaces, proposing a latent-space objective for mode switching, and demonstrating substantial reward improvements when switching from failed to successful modes, as well as reverse-capability. The approach provides a practical pathway for influencing and understanding DRL policies without requiring explicit full-environment models, with potential impact on policy interpretability and controllability in safety-critical domains, subject to extending to more complex environments.

Abstract

In this work, we use optimal control to change the behavior of a deep reinforcement learning policy by optimizing directly in the policy's latent space. We hypothesize that distinct behavioral patterns, termed behavioral modes, can be identified within certain regions of a deep reinforcement learning policy's latent space, meaning that specific actions or strategies are preferred within these regions. We identify these behavioral modes using latent space dimension-reduction with \ac*{pacmap}. Using the actions generated by the optimal control procedure, we move the system from one behavioral mode to another. We subsequently utilize these actions as a filter for interpreting the neural network policy. The results show that this approach can impose desired behavioral modes in the policy, demonstrated by showing how a failed episode can be made successful and vice versa using the lunar lander reinforcement learning environment.
Paper Structure (12 sections, 4 equations, 11 figures, 1 table)

This paper contains 12 sections, 4 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Policy latent space illustration.
  • Figure 2: Case Study 1: *pacmap low-dimensional embedding showing latent space initial and goal location.
  • Figure 3: Case Study 1.
  • Figure 4: Case Study 1: Actions when chosen by the policy during the failed episode.
  • Figure 5: Case Study 1: Actions when switching behavior.
  • ...and 6 more figures