Model-based Reinforcement Learning for Parameterized Action Spaces
Renhao Zhang, Haotian Fu, Yilin Miao, George Konidaris
TL;DR
This work tackles reinforcement learning with Parameterized Action MDPs (PAMDPs), where actions are discrete with continuous parameters, by introducing DLPA, a model-based framework that learns a parameterized-action-conditioned dynamics model and uses a PAMDP-aware MPPI planner. DLPA comprises three key innovations: multiple transition-inference architectures, an H-step trajectory loss to train dynamics, and separate reward predictors conditioned on termination, all under a Lipschitz-based theoretical analysis that bounds planning regret given model errors. Theoretical results show that, under Lipschitz continuity and bounded model errors, DLPA's rollout remains close to the optimal trajectory, with bounds that depend on discrete-action misclassification, continuous-parameter distribution differences, and prediction errors. Empirically, DLPA delivers superior sample efficiency and competitive asymptotic performance across eight PAMDP benchmarks, including very large action spaces, significantly outperforming state-of-the-art model-free baselines and demonstrating robustness to planning with imperfect models. Overall, the work establishes model-based RL as a viable and effective approach for complex PAMDPs and provides practical planning and learning mechanisms tailored to their structured action spaces.
Abstract
We propose a novel model-based reinforcement learning algorithm -- Dynamics Learning and predictive control with Parameterized Actions (DLPA) -- for Parameterized Action Markov Decision Processes (PAMDPs). The agent learns a parameterized-action-conditioned dynamics model and plans with a modified Model Predictive Path Integral control. We theoretically quantify the difference between the generated trajectory and the optimal trajectory during planning in terms of the value they achieved through the lens of Lipschitz Continuity. Our empirical results on several standard benchmarks show that our algorithm achieves superior sample efficiency and asymptotic performance than state-of-the-art PAMDP methods.
