Model-based Reinforcement Learning for Parameterized Action Spaces

Renhao Zhang; Haotian Fu; Yilin Miao; George Konidaris

Model-based Reinforcement Learning for Parameterized Action Spaces

Renhao Zhang, Haotian Fu, Yilin Miao, George Konidaris

TL;DR

This work tackles reinforcement learning with Parameterized Action MDPs (PAMDPs), where actions are discrete with continuous parameters, by introducing DLPA, a model-based framework that learns a parameterized-action-conditioned dynamics model and uses a PAMDP-aware MPPI planner. DLPA comprises three key innovations: multiple transition-inference architectures, an H-step trajectory loss to train dynamics, and separate reward predictors conditioned on termination, all under a Lipschitz-based theoretical analysis that bounds planning regret given model errors. Theoretical results show that, under Lipschitz continuity and bounded model errors, DLPA's rollout remains close to the optimal trajectory, with bounds that depend on discrete-action misclassification, continuous-parameter distribution differences, and prediction errors. Empirically, DLPA delivers superior sample efficiency and competitive asymptotic performance across eight PAMDP benchmarks, including very large action spaces, significantly outperforming state-of-the-art model-free baselines and demonstrating robustness to planning with imperfect models. Overall, the work establishes model-based RL as a viable and effective approach for complex PAMDPs and provides practical planning and learning mechanisms tailored to their structured action spaces.

Abstract

We propose a novel model-based reinforcement learning algorithm -- Dynamics Learning and predictive control with Parameterized Actions (DLPA) -- for Parameterized Action Markov Decision Processes (PAMDPs). The agent learns a parameterized-action-conditioned dynamics model and plans with a modified Model Predictive Path Integral control. We theoretically quantify the difference between the generated trajectory and the optimal trajectory during planning in terms of the value they achieved through the lens of Lipschitz Continuity. Our empirical results on several standard benchmarks show that our algorithm achieves superior sample efficiency and asymptotic performance than state-of-the-art PAMDP methods.

Model-based Reinforcement Learning for Parameterized Action Spaces

TL;DR

Abstract

Paper Structure (23 sections, 4 theorems, 39 equations, 11 figures, 7 tables)

This paper contains 23 sections, 4 theorems, 39 equations, 11 figures, 7 tables.

Introduction
Background
Parameterized Action Markov Decision Processes
Model Predictive Control (MPC)
Related Work
Dynamics Learning and Predictive Control with Parameterized Actions
Dynamics Model with Parameterized Actions
MPC with Parameterized Actions
Analysis
Experiments
Results
Ablation Study
Visualization of Planning Iterations
Conclusion
Algorithm
...and 8 more sections

Key Result

Theorem 5.2

For a $(L_R^S, L_R^K, L_R^Z, L_T^S, L_T^K, L_T^Z)$-Lipschitz PAMDP and the learned DLPA $\epsilon_{T}$-accurate transition model $\hat{T}$ and $\epsilon_{R}$-accurate reward model $\hat{T}$, let $L_{\Bar{T}}^S = \min \{L_T^S, L_{\hat{T}}^S\}$, $L_{\Bar{T}}^K = \min \{L_T^K, L_{\hat{T}}^K\}$, $L_{\Ba

Figures (11)

Figure 1: Three distinct inference architectures for the predictors. All the models are parameterized with $\phi$.
Figure 2: Left: Inference of dynamics during training. Variables colored with default black are those we feed as input to the dynamics model. Variables colored with grey are those generated from the dynamics model. Right: Planning and interacting with the environment. At each time step we execute only the first action from the sampled trajectory. White lines are example rollout trajectories from DLPA.The black line denotes the final selected rollout trajectory for one planning step.
Figure 3: Visualization of the tested environments.
Figure 4: Comparison of different algorithms across the 8 PAMDP benchmarks. Our algorithm DLPA significantly outperforms state-of-the-art PAMDP algorithms in terms of sample efficiency. Note that HyAR has an additional 20000 environment steps pretraining for the action encoder which we do not include in the plot.
Figure 5: Ablation study on (a) the planning algorithm, (b) H-step prediction loss, (c) PAMDP-specific MPPI, (d) different inference model architectures.
...and 6 more figures

Theorems & Definitions (8)

Definition 5.1
Theorem 5.2
Lemma 5.3
Theorem 2.1
proof
proof
Lemma 2.2
proof

Model-based Reinforcement Learning for Parameterized Action Spaces

TL;DR

Abstract

Model-based Reinforcement Learning for Parameterized Action Spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (8)