Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes
Zijian Wang, Bin Wang, Mingwen Shao, Hongbo Dou, Boxiang Tao
TL;DR
FLEXplore tackles reinforcement learning with Parameterized Action MDPs by introducing a model-based approach that learns a parameterized-action-conditioned, loose yet flexible dynamics model and plans with a modified Model Predictive Control. The method combines a Wasserstein-distance–guided loss to prevent overfitting of dynamics, reward smoothing to mitigate premature convergence, and a variational mutual-information objective to boost exploration during model rollouts. Theoretical analysis yields a regret bound for rollout trajectories under Lipschitz-PAMDP assumptions, linking performance to Wasserstein distances and Lipschitz constants. Empirically, FLEXplore achieves superior early learning efficiency and competitive or best-in-class asymptotic performance across six PAMDP benchmarks, outperforming several model-free and model-based baselines and demonstrating desirable dynamics flexibility and exploration behavior.
Abstract
Hybrid action models are widely considered an effective approach to reinforcement learning (RL) modeling. The current mainstream method is to train agents under Parameterized Action Markov Decision Processes (PAMDPs), which performs well in specific environments. Unfortunately, these models either exhibit drastic low learning efficiency in complex PAMDPs or lose crucial information in the conversion between raw space and latent space. To enhance the learning efficiency and asymptotic performance of the agent, we propose a model-based RL (MBRL) algorithm, FLEXplore. FLEXplore learns a parameterized-action-conditioned dynamics model and employs a modified Model Predictive Path Integral control. Unlike conventional MBRL algorithms, we carefully design the dynamics loss function and reward smoothing process to learn a loose yet flexible model. Additionally, we use the variational lower bound to maximize the mutual information between the state and the hybrid action, enhancing the exploration effectiveness of the agent. We theoretically demonstrate that FLEXplore can reduce the regret of the rollout trajectory through the Wasserstein Metric under given Lipschitz conditions. Our empirical results on several standard benchmarks show that FLEXplore has outstanding learning efficiency and asymptotic performance compared to other baselines.
