Action Mapping for Reinforcement Learning in Continuous Environments with Constraints
Mirco Theile, Lukas Dirnberger, Raphael Trumpp, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
TL;DR
The paper tackles sample efficiency and constraint handling in deep reinforcement learning for continuous action spaces. It introduces action mapping (AM), which learns a feasibility policy $\pi_f$ that maps latent $z$ to the state-dependent feasible-action set and composes it with an objective policy $\pi_o$ so that $\pi = \pi_f \circ \pi_o$, effectively turning a state-wise constrained MDP into an unconstrained MDP. AM instantiates with SAC and PPO (AM-SAC, AM-PPO) and is evaluated on a robotic arm pose task with perfect feasibility and a spline-based path planning task with approximate feasibility, outperforming action replacement, resampling, projection, and Lagrangian baselines in learning speed and constraint adherence. The work demonstrates that imperfect feasibility models can provide valuable inductive bias and enable multi-modal action distributions via latent representations, with practical implications for constrained robotics and autonomous planning.
Abstract
Deep reinforcement learning (DRL) has had success across various domains, but applying it to environments with constraints remains challenging due to poor sample efficiency and slow convergence. Recent literature explored incorporating model knowledge to mitigate these problems, particularly through the use of models that assess the feasibility of proposed actions. However, integrating feasibility models efficiently into DRL pipelines in environments with continuous action spaces is non-trivial. We propose a novel DRL training strategy utilizing action mapping that leverages feasibility models to streamline the learning process. By decoupling the learning of feasible actions from policy optimization, action mapping allows DRL agents to focus on selecting the optimal action from a reduced feasible action set. We demonstrate through experiments that action mapping significantly improves training performance in constrained environments with continuous action spaces, especially with imperfect feasibility models.
