Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes

Zijian Wang; Bin Wang; Mingwen Shao; Hongbo Dou; Boxiang Tao

Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes

Zijian Wang, Bin Wang, Mingwen Shao, Hongbo Dou, Boxiang Tao

TL;DR

FLEXplore tackles reinforcement learning with Parameterized Action MDPs by introducing a model-based approach that learns a parameterized-action-conditioned, loose yet flexible dynamics model and plans with a modified Model Predictive Control. The method combines a Wasserstein-distance–guided loss to prevent overfitting of dynamics, reward smoothing to mitigate premature convergence, and a variational mutual-information objective to boost exploration during model rollouts. Theoretical analysis yields a regret bound for rollout trajectories under Lipschitz-PAMDP assumptions, linking performance to Wasserstein distances and Lipschitz constants. Empirically, FLEXplore achieves superior early learning efficiency and competitive or best-in-class asymptotic performance across six PAMDP benchmarks, outperforming several model-free and model-based baselines and demonstrating desirable dynamics flexibility and exploration behavior.

Abstract

Hybrid action models are widely considered an effective approach to reinforcement learning (RL) modeling. The current mainstream method is to train agents under Parameterized Action Markov Decision Processes (PAMDPs), which performs well in specific environments. Unfortunately, these models either exhibit drastic low learning efficiency in complex PAMDPs or lose crucial information in the conversion between raw space and latent space. To enhance the learning efficiency and asymptotic performance of the agent, we propose a model-based RL (MBRL) algorithm, FLEXplore. FLEXplore learns a parameterized-action-conditioned dynamics model and employs a modified Model Predictive Path Integral control. Unlike conventional MBRL algorithms, we carefully design the dynamics loss function and reward smoothing process to learn a loose yet flexible model. Additionally, we use the variational lower bound to maximize the mutual information between the state and the hybrid action, enhancing the exploration effectiveness of the agent. We theoretically demonstrate that FLEXplore can reduce the regret of the rollout trajectory through the Wasserstein Metric under given Lipschitz conditions. Our empirical results on several standard benchmarks show that FLEXplore has outstanding learning efficiency and asymptotic performance compared to other baselines.

Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes

TL;DR

Abstract

Paper Structure (28 sections, 8 theorems, 48 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 8 theorems, 48 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Background
Lipschitz-PAMDPs
Model Predictive Control (MPC)
Learn A Loose yet Flexible Model
The Advantage of Wasserstein Metric
Loss Function
Practical Implementation
Reward Smoothing in Stable Training Phase
Encourage Exploration by Maximizing Mutual Information
How FLEXplore Reduces the Regret of the Rollout Trajectory?
Experiments
Overall Performance
Loose yet Flexible Dynamics Learning
Dynamics Consistency Error
...and 13 more sections

Key Result

Theorem 3.1

Given a $H$-steps sequence $\tau = \{ s_{t_0},k_{t_0},z_{k_{t_0}}, \dots, s_{t_0 + H},k_{t_0 + H},z_{k_{t_0 + H}} \}$ sampled from environment, an upper bound of the loss function $\mathcal{L}^\text{ex}(\tau;f)$ is the $\gamma$-weighted sum for the Wasserstein Metric between environment dynamics $\m

Figures (7)

Figure 1: For simplicity, we designate the state as scalar, and the rectangular block represents a distribution. The orange rectangle on the left is given by environmental dynamics, while the green rectangle on the right is given by the learned dynamics, and both rectangles are congruent.
Figure 2: An intuitional example of reward smoothing for Catch Point environment. In Catch Point, the agent can either move or execute a catch, with the latter having a specific range indicated by the light blue area. Upon successfully capturing the target, the agent receives the highest reward, which corresponds to the maximum value of the reward function. If the agent is merely close to capturing the target, it will not receive a high reward, as the conditions for successful capture have not been met. Given the continuity of the state space, the corresponding state will reside within a small neighborhood of the optimal state, as indicated by the yellow region, which represents the region where we aim to implement reward smoothing.
Figure 3: The framework of FLEXplore. In the model learning phase (the left subplot), the agent samples sequence $\tau = \{ s_0,k_0,z_{k_0},k_1,z_{k_1},\dots,k_H,z_{k_H} \}$ from the replay buffer and then employs it within the model to generate predicted states $\hat{s}_{t+1}$ and rewards $\hat{r}_t$ for each timestep $t$. It is trained with $\mathcal{L}^{\text{total}}_{\text{dyn}}$ and $\mathcal{L}^{\text{total}}_{\text{rew}}$ respectively, which correspond to flexible model learning in \ref{['Section 3']} and reward smoothing in \ref{['Section 4']}. In the model rollout phase (the right subplot), the agent interacts using a hybrid action sequence on the learned model. In particular, we maximize mutual information as an auxiliary reward signal $r^{\text{aux}}$ to enhance agent exploration. The agent then interacts with the environment by selecting the first hybrid action $(k,z_k)$ from the hybrid action sequence with the highest trajectory return and stores the resulting feedback in the replay buffer.
Figure 4: Performance evaluation of FLEXplore and baselines over 4 seeds across six standard PAMDP benchmarks. To demonstrate the agent's learning efficiency, we plot the average return over the first 20,000 time steps. FLEXplore achieves the best performance in the first 20,000 time steps on all the six benchmarks.
Figure 5: Investigation on loose yet flexible dynamics learning. (a) Dynamics consistency error of FLEXplore and DLPA, which shows FLEXplore can learn a looser dynamics. (b) Asymptotic performance of FLEXplore and DLPA under adversarial sample test with different strengths, which shows FLEXplore can learn a more flexible dynamics.
...and 2 more figures

Theorems & Definitions (11)

Definition 2.1: Lipschitz Consistant
Definition 2.2: Lipschitz-PAMDPs
Theorem 3.1: Looseness
Theorem 3.2: Flexibility
Corollary 3.1: State distribution
Corollary 3.2
Proposition 5.1: Variational lower bound
Theorem 6.1: the regret upper bound, proved by zhang2024model
Definition 1: Opposite Functions Set
Lemma 1
...and 1 more

Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes

TL;DR

Abstract

Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (11)