Table of Contents
Fetching ...

Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distributions

Roland Stolz, Michael Eichelbeck, Matthias Althoff

TL;DR

This work tackles action-constrained reinforcement learning by directly truncating the policy distribution to feasible action sets and addressing the key computational challenges: evaluating log-probability, entropy, and gradients under truncation, plus efficient differentiable sampling. It delivers accurate numerical approximations for truncated Gaussian metrics across convex, non-convex, and disjoint sets, and introduces a hybrid sampling scheme that combines rejection sampling with geometric random walks, paired with a differentiable reparameterization for truncated sampling. The approach yields substantial performance gains on three benchmarks compared to methods that rely on non-truncated metrics, and highlights the importance of precise truncated-distribution calculations in safe RL. The findings lay groundwork for broader applications of truncated distributions in RL and related probabilistic learning settings, including physics-informed and curriculum-based methods.

Abstract

In reinforcement learning (RL), it is often advantageous to consider additional constraints on the action space to ensure safety or action relevance. Existing work on such action-constrained RL faces challenges regarding effective policy updates, computational efficiency, and predictable runtime. Recent work proposes to use truncated normal distributions for stochastic policy gradient methods. However, the computation of key characteristics, such as the entropy, log-probability, and their gradients, becomes intractable under complex constraints. Hence, prior work approximates these using the non-truncated distributions, which severely degrades performance. We argue that accurate estimation of these characteristics is crucial in the action-constrained RL setting, and propose efficient numerical approximations for them. We also provide an efficient sampling strategy for truncated policy distributions and validate our approach on three benchmark environments, which demonstrate significant performance improvements when using accurate estimations.

Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distributions

TL;DR

This work tackles action-constrained reinforcement learning by directly truncating the policy distribution to feasible action sets and addressing the key computational challenges: evaluating log-probability, entropy, and gradients under truncation, plus efficient differentiable sampling. It delivers accurate numerical approximations for truncated Gaussian metrics across convex, non-convex, and disjoint sets, and introduces a hybrid sampling scheme that combines rejection sampling with geometric random walks, paired with a differentiable reparameterization for truncated sampling. The approach yields substantial performance gains on three benchmarks compared to methods that rely on non-truncated metrics, and highlights the importance of precise truncated-distribution calculations in safe RL. The findings lay groundwork for broader applications of truncated distributions in RL and related probabilistic learning settings, including physics-informed and curriculum-based methods.

Abstract

In reinforcement learning (RL), it is often advantageous to consider additional constraints on the action space to ensure safety or action relevance. Existing work on such action-constrained RL faces challenges regarding effective policy updates, computational efficiency, and predictable runtime. Recent work proposes to use truncated normal distributions for stochastic policy gradient methods. However, the computation of key characteristics, such as the entropy, log-probability, and their gradients, becomes intractable under complex constraints. Hence, prior work approximates these using the non-truncated distributions, which severely degrades performance. We argue that accurate estimation of these characteristics is crucial in the action-constrained RL setting, and propose efficient numerical approximations for them. We also provide an efficient sampling strategy for truncated policy distributions and validate our approach on three benchmark environments, which demonstrate significant performance improvements when using accurate estimations.

Paper Structure

This paper contains 36 sections, 3 theorems, 34 equations, 5 figures, 2 tables.

Key Result

Proposition 1

Given the distribution truncated to the i-th interval $f(x; \mathcal{I}^{(i)})$, the PDF of the distribution $f(x)$ truncated to the union of $k$ non-overlapping intervals $\mathcal{U}_\mathcal{I}$ is where $Z_\mathcal{U}$ is the normalizing constant of $f(x; \mathcal{U}_\mathcal{I})$, and $w_i = \frac{Z_{\mathcal{I}^{(i)}}}{Z_\mathcal{U}}$ is the relative probability mass of the i-th interval.

Figures (5)

  • Figure 1: The PDF of the standard normal distribution $f(x)$ is truncated to the interval $[l, u]$ to obtain the truncated PDF $f(x; [l, u])$. The area under the curve is normalized to one.
  • Figure 2: The reparameterization trick: We sample the independent random variable $\tilde{\varepsilon}$ from $\mathcal{N}_{\tilde{\mathcal{A}^s}}(0,1)$ truncated to $\tilde{\mathcal{A}^s}$, and then apply the affine transformation $a^s = \mu + L \, \tilde{\varepsilon}$ to obtain samples from $\pi_\theta^s({\cdot} \vert s) = \mathcal{N}_{\mathcal{A}^s}(\mu, \Sigma)$, where $\Sigma = L L^T$.
  • Figure 3: Errors for the integral approximation using the outer, inner, and combined approximation of the polytope.
  • Figure 4: Comparison of the sampling times using the RDHR, rejection sampling and the combined sampling method with rejection limit $M=100$.
  • Figure 5: The mean returns and $95\%$ confidence intervals during RL training. App-Poly-Out, App-Poly-Inn, and App-Poly-Comb, refer to the outer, inner, and combined approximations, while Og-Int and Og-Poly refer to the interval and polytope policies using the original metrics of $\pi_\theta({\cdot} \vert s)$. Exact-Int is the interval policy that computes exact analytic metrics.

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof