Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning

Yuanyang Zhu; Zhi Wang; Yuanheng Zhu; Chunlin Chen; Dongbin Zhao

Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning

Yuanyang Zhu, Zhi Wang, Yuanheng Zhu, Chunlin Chen, Dongbin Zhao

TL;DR

This work tackles the challenge of discretizing continuous action spaces in on-policy reinforcement learning by enforcing a unimodal, order-aware distribution over discretized actions. It introduces a Poisson-based ordinal architecture where each action dimension has a PMF parameterized by a nonnegative rate $\lambda_i$ learned from the state, with a right-truncated Softmax to maintain unimodality and reduce variance. A variance analysis suggests that the Poisson unimodal policy can yield lower gradient variance than traditional ordinal or Gibbs parameterizations, especially with moderate discretization $K$. Empirical results on MuJoCo locomotion tasks, particularly high-dimensional Humanoid environments, show faster convergence and higher performance than several baselines, highlighting practical impact for scalable, stable on-policy control.

Abstract

For on-policy reinforcement learning, discretizing action space for continuous control can easily express multiple modes and is straightforward to optimize. However, without considering the inherent ordering between the discrete atomic actions, the explosion in the number of discrete actions can possess undesired properties and induce a higher variance for the policy gradient estimator. In this paper, we introduce a straightforward architecture that addresses this issue by constraining the discrete policy to be unimodal using Poisson probability distributions. This unimodal architecture can better leverage the continuity in the underlying continuous action space using explicit unimodal probability distributions. We conduct extensive experiments to show that the discrete policy with the unimodal probability distribution provides significantly faster convergence and higher performance for on-policy reinforcement learning algorithms in challenging control tasks, especially in highly complex tasks such as Humanoid. We provide theoretical analysis on the variance of the policy gradient estimator, which suggests that our attentively designed unimodal discrete policy can retain a lower variance and yield a stable learning process.

Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning

TL;DR

learned from the state, with a right-truncated Softmax to maintain unimodality and reduce variance. A variance analysis suggests that the Poisson unimodal policy can yield lower gradient variance than traditional ordinal or Gibbs parameterizations, especially with moderate discretization

. Empirical results on MuJoCo locomotion tasks, particularly high-dimensional Humanoid environments, show faster convergence and higher performance than several baselines, highlighting practical impact for scalable, stable on-policy control.

Abstract

Paper Structure (13 sections, 18 equations, 8 figures, 3 tables)

This paper contains 13 sections, 18 equations, 8 figures, 3 tables.

Introduction
Background and related work
Preliminaries
Related Work
Unimodal Probability Distributions for On-Policy Reinforcement Learning
Discretizing Action Space for Continuous Control
Unimodal ordinal architecture
Variance Analysis
Experiments
Comparison with Benchmark Baselines
Comparison with Discrete Policy
Stability and HyperParameter Analysis
Conclusion and Limitations

Figures (8)

Figure 1: An example of continuous control with different action probability distributions over only one action dimension. Top: The unimodal continuous policy distribution is constructed from the Gaussian distribution that estimates its mean $\mu$ and standard deviation $\sigma$ with a function approximator such as deep neural networks. Middle: When there are $K$ discrete actions, the policy distribution can be represented by a Gibbs distribution, wherein $K$ logits are generated by a function approximator via a sigmoid function. This results in a distribution that displays multi-modality characteristics. Bottom: When there are $K$ discrete actions, the new unimodal ordinal policy distribution can be characterized by the Poisson ordinal distribution, which outputs a probability mass function ($\lambda$) via a function approximator through a Softplus function. This distribution ensures that the two classes adjacent to the majority class receive the next greatest probability mass.
Figure 2: Normalization of the log-likelihood Poisson distributions. For each curve, we sample $21$ action distributions $j\in [0,20]$ and plot the normalized log-likelihood of the Poisson distribution curve with different values of the network output $f(s)$ and the temperature $\tau$ by evaluating $p(a_{ij}\mid s)$ with the Eq. \ref{['softmax']}. The maximum probability will peak at the $f(s)$, where the probability mass gradually decreases on both sides of the class.
Figure 3: For simplicity, we illustrate the operation of the unimodal distribution over only one action dimension $i$. the first layer following $f_i(x)$ acts as a 'copy' layer, where $f_i(s)=f_i(s)_1=...=f_i(s)_K$. The second layer applies the log Poisson PMF transform, followed by the Softmax layer. The third layer normalizes the required probability distributions since the support of the Poisson is infinite. We then compute the final logits by ordinal parameterization as in Eq. \ref{['logits']}. Finally, we derive the final output probability via a Softmax operation, where the actions are sampled according to this output distribution.
Figure 4: Performance as a function of the number of learning steps of PPO and TRPO on OpenAI gym MuJoCo locomotion tasks. Solid lines are average values over $6$ random seeds. Shaded regions correspond to one standard deviation. Each curve corresponds to a different policy architecture (Gaussian or unimodal policy with varying bins $K = 9, 11,15$). Our unimodal policy significantly outperforms the Gaussian policy on most tasks.
Figure 5: Learning curves of PPO with unimodal, discrete, ordinal and Gaussian policy on OpenAI gym MuJoCo locomotion tasks. Solid lines and shadings denote the average values and standard deviation over $6$ random seeds. All discrete policies have $K = 11$. We see that the unimodal policy outperforms the other policies in terms of performance and stability on each task, especially on the Humanoid control tasks.
...and 3 more figures

Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning

TL;DR

Abstract

Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)