Table of Contents
Fetching ...

DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning

Taisuke Kobayashi

TL;DR

This paper introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference, and suggests that DROP is a new model that can elicit the potential contributions of optimism and pessimism.

Abstract

In reinforcement learning (RL), temporal difference (TD) error is known to be related to the firing rate of dopamine neurons. It has been observed that each dopamine neuron does not behave uniformly, but each responds to the TD error in an optimistic or pessimistic manner, interpreted as a kind of distributional RL. To explain such a biological data, a heuristic model has also been designed with learning rates asymmetric for the positive and negative TD errors. However, this heuristic model is not theoretically-grounded and unknown whether it can work as a RL algorithm. This paper therefore introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference. In combination with ensemble learning, a distributional value function as a critic is estimated from regularly introduced optimism and pessimism. Based on its central value, a policy in an actor is improved. This proposed algorithm, so-called DROP (distributional and regular optimism and pessimism), is compared on dynamic tasks. Although the heuristic model showed poor learning performance, DROP showed excellent one in all tasks with high generality. In other words, it was suggested that DROP is a new model that can elicit the potential contributions of optimism and pessimism.

DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning

TL;DR

This paper introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference, and suggests that DROP is a new model that can elicit the potential contributions of optimism and pessimism.

Abstract

In reinforcement learning (RL), temporal difference (TD) error is known to be related to the firing rate of dopamine neurons. It has been observed that each dopamine neuron does not behave uniformly, but each responds to the TD error in an optimistic or pessimistic manner, interpreted as a kind of distributional RL. To explain such a biological data, a heuristic model has also been designed with learning rates asymmetric for the positive and negative TD errors. However, this heuristic model is not theoretically-grounded and unknown whether it can work as a RL algorithm. This paper therefore introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference. In combination with ensemble learning, a distributional value function as a critic is estimated from regularly introduced optimism and pessimism. Based on its central value, a policy in an actor is improved. This proposed algorithm, so-called DROP (distributional and regular optimism and pessimism), is compared on dynamic tasks. Although the heuristic model showed poor learning performance, DROP showed excellent one in all tasks with high generality. In other words, it was suggested that DROP is a new model that can elicit the potential contributions of optimism and pessimism.

Paper Structure

This paper contains 19 sections, 16 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Heuristic model designed in the previous studies dabney2020distributionalmuller2024distributional: because the model is an ensemble model with each dopamine neuron estimating its value function individually, the origin at which the TD error becomes zero is different for each neuron; each neuron has an asymmetric learning rate depending on the sign of TD error, so that some neurons learn optimistically, preferring a better outcome than predicted, while others are pessimistic for preferring a worse outcome than predicted.
  • Figure 2: Inversion of the definition of optimality: the definition of optimality in the previous studies levine2018reinforcementkobayashi2022optimistic was that the larger the value function, the more optimal (i.e. $O=1$) it is, as shown on the left side; the inversion is similar but different that the smaller the value function, the more non-optimal (i.e. $O=0$) it is, as shown on the right side.
  • Figure 3: Optimistic and pessimistic TD errors parameterized by $\beta \in \mathbb{R}$: when $\beta > 0$, the update scale is positively biased with optimism from the original TD error; symmetrically, pessimism can be obtained with $\beta < 0$.
  • Figure 4: Distributional value function modeled by an ensemble model of multiple value functions with different optimism/pessimism: the network parameters before the output layer are all shared between $N$ value functions; with randomly fixed weights for diversity, the output layer separately estimates the respective value functions, which are trained with the corresponding $\beta_1, \ldots, \beta_N$; as value functions trained with biased TD errors are biased as well, their ensemble can represent the distribution of value functions.
  • Figure 5: Policy improvement with ensemble of optimistic/pessimistic value functions: since learning multiple policies is costly, a single policy is optimized based on a central value of nonlinear TD errors calculated from multiple value functions; by employing the median as this central value, the direction of the policy improvement can be properly determined without being affected by outliers.
  • ...and 6 more figures