Table of Contents
Fetching ...

Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

Taisuke Kobayashi

Abstract

In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.

Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

Abstract

In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.

Paper Structure

This paper contains 28 sections, 38 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Replacement of the model for optimality probability: (a) the model with exponential function is usually employed due to its simplicity; (b) to always satisfy the definition of probability, a new model with sigmoid function is introduced in this work.
  • Figure 2: Analysis of nonlinear TD errors derived with forward and reverse KL divergences: in both cases, the gradients are vanished if the value estimate is closer to upper or lower bound; with reverse KL divergence (left), the contour lines are mirrored for the dashed line ($\delta=0$); with forward KL divergence (right), the contour lines are symmetric for the dashed line.
  • Figure 3: Decomposition of nonlinear TD errors into weight and error terms: the weight term (left) is saturated (or vanished) only in the case with reverse KL divergence; when $\sigma_V \simeq \{0, 1\}$, the error term in the case with forward KL divergence outputs almost positive or negative values.
  • Figure 4: Extension of the model for optimality probability: since sigmoid function always makes its output within $[0, 1]$, multiple optimality variables can be easily introduced; by evenly spacing them, the saturation property is expected between them as a kind of (pseudo-)quantization.
  • Figure 5: Analysis of nonlinear TD errors with the extension of optimality probability: if $\lambda$ is too small (left), the saturation property between adjacent optimalities is canceled out; by setting $\lambda$ is enough large (right), the expected pseudo-quantization with vanishment of gradients between adjacent optimalities is observed.
  • ...and 7 more figures