Table of Contents
Fetching ...

q-exponential family for policy optimization

Lingwei Zhu, Haseeb Shah, Han Wang, Yukie Nagai, Martha White

TL;DR

This work introduces the $q$-exponential family as a flexible and tractable policy class for continuous-action reinforcement learning, enabling heavy-tailed ($q>1$) and light-tailed ($q<1$) policies, with $q=1$ recovering the standard exponential family. It systematically embeds $q$-exponential policies, including $q$-Gaussian and Student's t variants, into online and offline actor-critic algorithms and analyzes practical concerns such as entropy approximations and out-of-support actions. Across online Classic Control and offline D4RL MuJoCo benchmarks, heavy-tailed policies generally improve performance over the Gaussian baseline, with the Student's t policy offering strong stability and the heavy-tailed $q$-Gaussian benefiting Tsallis-based regularization in offline settings. The results demonstrate tail flexibility as a practical lever for exploration robustness and offline data leverage, and the authors provide code to facilitate adoption of these policies in RL research and applications.

Abstract

Policy optimization methods benefit from a simple and tractable policy parametrization, usually the Gaussian for continuous action spaces. In this paper, we consider a broader policy family that remains tractable: the $q$-exponential family. This family of policies is flexible, allowing the specification of both heavy-tailed policies ($q>1$) and light-tailed policies ($q<1$). This paper examines the interplay between $q$-exponential policies for several actor-critic algorithms conducted on both online and offline problems. We find that heavy-tailed policies are more effective in general and can consistently improve on Gaussian. In particular, we find the Student's t-distribution to be more stable than the Gaussian across settings and that a heavy-tailed $q$-Gaussian for Tsallis Advantage Weighted Actor-Critic consistently performs well in offline benchmark problems. Our code is available at \url{https://github.com/lingweizhu/qexp}.

q-exponential family for policy optimization

TL;DR

This work introduces the -exponential family as a flexible and tractable policy class for continuous-action reinforcement learning, enabling heavy-tailed () and light-tailed () policies, with recovering the standard exponential family. It systematically embeds -exponential policies, including -Gaussian and Student's t variants, into online and offline actor-critic algorithms and analyzes practical concerns such as entropy approximations and out-of-support actions. Across online Classic Control and offline D4RL MuJoCo benchmarks, heavy-tailed policies generally improve performance over the Gaussian baseline, with the Student's t policy offering strong stability and the heavy-tailed -Gaussian benefiting Tsallis-based regularization in offline settings. The results demonstrate tail flexibility as a practical lever for exploration robustness and offline data leverage, and the authors provide code to facilitate adoption of these policies in RL research and applications.

Abstract

Policy optimization methods benefit from a simple and tractable policy parametrization, usually the Gaussian for continuous action spaces. In this paper, we consider a broader policy family that remains tractable: the -exponential family. This family of policies is flexible, allowing the specification of both heavy-tailed policies () and light-tailed policies (). This paper examines the interplay between -exponential policies for several actor-critic algorithms conducted on both online and offline problems. We find that heavy-tailed policies are more effective in general and can consistently improve on Gaussian. In particular, we find the Student's t-distribution to be more stable than the Gaussian across settings and that a heavy-tailed -Gaussian for Tsallis Advantage Weighted Actor-Critic consistently performs well in offline benchmark problems. Our code is available at \url{https://github.com/lingweizhu/qexp}.
Paper Structure (28 sections, 26 equations, 22 figures, 8 tables, 2 algorithms)

This paper contains 28 sections, 26 equations, 22 figures, 8 tables, 2 algorithms.

Figures (22)

  • Figure 1: The policy parametrizations considered in this paper.
  • Figure 2: Performance relative to the Squashed Gaussian on the offline D4RL MuJoCo task, averaged across the selected algorithms and environments.
  • Figure 3: $\exp_q x$ and $\ln_q x$ for $q<1$ and $q>1$. When $q=1$ they respectively recover their standard counterpart. For $q<1$ the $q$-exp can return zero values and hence $q$-exp policies may achieve sparsity. For $q>1$, $q$-exp decays more slowly towards 0, resulting in heavy-tailed behaviors. The rightmost shows the $q$-Gaussian with different $q$.
  • Figure 4: Learning curves on the classic control environments. Only the Gaussian and the best policy parametrization for each setting were shown with full opacity. The best policy is picked based on the total area under the curve (AUC). TAWAC(0) refers to TAWAC with entropic index $q'=0$ in Eq. (\ref{['eq:main_tawac_loss']}). Despite tuning hyperparameters separately for each policy, Gaussian is the best policy in only $1/12$ settings. In most other settings, the Gaussian policy performs significantly worse than the best.
  • Figure 5: (Left) The percentage of times that each policy parametrization is better than the Gaussian across all algorithm-environment combinations based on total AUC. If the bar is above the $50\%$ line, then it means that the said policy parametrization is better than Gaussian on average. We see that Student's t and Light-tailed Gaussians are better than the Gaussian in $75\%$ and $66\%$ of the settings, respectively. (Right) Count of times where a policy parametrization performed the best across all algorithm-environment combinations based on AUC. We observe that the student-t policy performed the best in $5/12$ settings, whereas the Gaussian policy performed the best only once.
  • ...and 17 more figures