Generalized Gaussian Temporal Difference Error for Uncertainty-aware Reinforcement Learning

Seyeon Kim; Joonhun Lee; Namhoon Cho; Sungjun Han; Wooseop Hwang

Generalized Gaussian Temporal Difference Error for Uncertainty-aware Reinforcement Learning

Seyeon Kim, Joonhun Lee, Namhoon Cho, Sungjun Han, Wooseop Hwang

TL;DR

Conventional TD learning often relies on a zero-mean Gaussian TD error, which misrepresents tail uncertainty. The paper introduces a generalized Gaussian error framework with a shape parameter $\beta$ to capture tail behavior and derives a closed-form expression for aleatoric uncertainty and a risk-averse weighting to mitigate epistemic uncertainty, complemented by a batch inverse error variance regularization. The method is instantiated with a beta-head that estimates $\beta$ (and fixes $\alpha=1$) and evaluated with SAC and PPO on MuJoCo and discrete tasks, showing improved sample efficiency and robustness. This work advances uncertainty-aware RL by accommodating non-Gaussian TD errors and providing practical mechanisms to balance aleatoric and epistemic uncertainty in learning.

Abstract

Conventional uncertainty-aware temporal difference (TD) learning often assumes a zero-mean Gaussian distribution for TD errors, leading to inaccurate error representations and compromised uncertainty estimation. We introduce a novel framework for generalized Gaussian error modeling in deep reinforcement learning to enhance the flexibility of error distribution modeling by incorporating additional higher-order moment, particularly kurtosis, thereby improving the estimation and mitigation of data-dependent aleatoric uncertainty. We examine the influence of the shape parameter of the generalized Gaussian distribution (GGD) on aleatoric uncertainty and provide a closed-form expression that demonstrates an inverse relationship between uncertainty and the shape parameter. Additionally, we propose a theoretically grounded weighting scheme to address epistemic uncertainty by fully leveraging the GGD. We refine batch inverse variance weighting with bias reduction and kurtosis considerations, enhancing robustness. Experiments with policy gradient algorithms demonstrate significant performance gains.

Generalized Gaussian Temporal Difference Error for Uncertainty-aware Reinforcement Learning

TL;DR

Conventional TD learning often relies on a zero-mean Gaussian TD error, which misrepresents tail uncertainty. The paper introduces a generalized Gaussian error framework with a shape parameter

to capture tail behavior and derives a closed-form expression for aleatoric uncertainty and a risk-averse weighting to mitigate epistemic uncertainty, complemented by a batch inverse error variance regularization. The method is instantiated with a beta-head that estimates

(and fixes

) and evaluated with SAC and PPO on MuJoCo and discrete tasks, showing improved sample efficiency and robustness. This work advances uncertainty-aware RL by accommodating non-Gaussian TD errors and providing practical mechanisms to balance aleatoric and epistemic uncertainty in learning.

Abstract

Paper Structure (34 sections, 4 theorems, 38 equations, 19 figures)

This paper contains 34 sections, 4 theorems, 38 equations, 19 figures.

Introduction
Contributions
Background
Uncertainty
Tailedness
Gumbel error modeling
Methods
Generalized Gaussian error modeling
Empirical evidence
Theoretical analysis
Batch inverse error variance regularization
Experiments
Discussion
Further investigation
Relevant applications
...and 19 more sections

Key Result

Proposition 1

Let $X_1, X_2, \ldots, X_n$ be a sequence of independent, non-normally distributed random variables from a population $X$ with mean $\mu$, variance $\sigma^2$, and kurtosis $\kappa$. Then, with the MLE estimators under normality assumption, i.e., $\hat{\mu}=\sum_{i=1}^n X_i/n$ for mean and $\hat{\si

Figures (19)

Figure 1: Generalized Gaussian distribution.
Figure 2: TD error plots of SAC at the initial and final evaluations, arranged left to right, with fitted probability density functions (PDFs) using SciPy virtanen2020scipy. Additional plots on other environments and for PPO are available in \ref{['apdx:tde']}. Note that the plots represent aggregated data to emphasize general trends rather than focusing on individual samples.
Figure 3: Sample efficiency curves of SAC on MuJoCo environments, illustrating median return values averaged over ten random seeds. Shaded regions indicate the standard deviation. Prefixes denote applied techniques, e.g., 'GD-' for variance head, 'GGD-' for beta head, and 'IEV-' for BIEV regularization.
Figure 4: Coefficients of variation of parameter estimates for SAC variants. Results for other environments can be found in \ref{['apdx:param']}.
Figure 5: Sample efficiency curves of PPO on various control environments.
...and 14 more figures

Theorems & Definitions (11)

Proposition 1: Biased variance estimator yuan2005effect
Remark 1: Varietal variance estimator burch2014estimating
Remark 2
Theorem 1: Positive-definiteness bochner1937stableushakov2011selecteddytso2018analytical
Theorem 2: Second-order stochastic dominance dytso2018analytical
Remark 3
Proposition 2: MBBE of variance searls1990notewencheko2009estimation
proof
proof
proof
...and 1 more

Generalized Gaussian Temporal Difference Error for Uncertainty-aware Reinforcement Learning

TL;DR

Abstract

Generalized Gaussian Temporal Difference Error for Uncertainty-aware Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (11)