Deep Reinforcement Learning Behavioral Mode Switching Using Optimal Control Based on a Latent Space Objective
Sindre Benjamin Remman, Bjørn Andreas Kristiansen, Anastasios M. Lekkas
TL;DR
This work tackles the interpretability and controllability of DRL policies by revealing latent-space behavioral modes via PaCMAP and actively steering the agent between modes using latent-space optimized control. The authors formulate an NLP that minimizes the distance between a latent-space target and the policy's latent projection at the horizon, solved with IPOPT through CasADi, and they apply it to a deterministic LunarLander-v2 setting. Key contributions include identifying behavioral modes in policy latent spaces, proposing a latent-space objective for mode switching, and demonstrating substantial reward improvements when switching from failed to successful modes, as well as reverse-capability. The approach provides a practical pathway for influencing and understanding DRL policies without requiring explicit full-environment models, with potential impact on policy interpretability and controllability in safety-critical domains, subject to extending to more complex environments.
Abstract
In this work, we use optimal control to change the behavior of a deep reinforcement learning policy by optimizing directly in the policy's latent space. We hypothesize that distinct behavioral patterns, termed behavioral modes, can be identified within certain regions of a deep reinforcement learning policy's latent space, meaning that specific actions or strategies are preferred within these regions. We identify these behavioral modes using latent space dimension-reduction with \ac*{pacmap}. Using the actions generated by the optimal control procedure, we move the system from one behavioral mode to another. We subsequently utilize these actions as a filter for interpreting the neural network policy. The results show that this approach can impose desired behavioral modes in the policy, demonstrated by showing how a failed episode can be made successful and vice versa using the lunar lander reinforcement learning environment.
