Reward-Punishment Reinforcement Learning with Maximum Entropy

Jiexin Wang; Eiji Uchibe

Reward-Punishment Reinforcement Learning with Maximum Entropy

Jiexin Wang, Eiji Uchibe

TL;DR

The paper tackles integrating long-term policy entropy into reward-punishment reinforcement learning to improve robustness and sample efficiency. It introduces softDMP, a maximum-entropy generalization of Deep MaxPain that lets $\eta$ vary across the real line with entropy coefficients $\eta_+$ and $\eta_-$, enabling interpolation among max, mellow-max, mean, and min operators; it also leverages a flipped pain-avoidance policy and a GAN-inspired discriminator to route experiences to separate replay buffers for learning the reward and punishment modules. The authors provide theoretical grounding and empirical validation in discrete Grid-world tasks and ROS Gazebo-based Turtlebot3 navigation, showing that the min operator paired with the flipped policy can better propagate negative signals and that buffer separation improves data efficiency and robustness. This framework offers a flexible, operator-smoothed approach to reward-punishment RL with practical impact on real-world robotic navigation under uncertainty.

Abstract

We introduce the ``soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives. Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditional ``max'' and ``min'' operators, where the goal is enhancing sample efficiency and robustness. We also address two unresolved issues from the previous Deep MaxPain method. Firstly, we investigate how the negated (``flipped'') pain-seeking sub-policy, derived from the punishment action value, collaborates with the ``min'' operator to effectively learn the punishment module and how softDMP's smooth learning operator provides insights into the ``flipping'' trick. Secondly, we tackle the challenge of data collection for learning the punishment module to mitigate inconsistencies arising from the involvement of the ``flipped'' sub-policy (pain-avoidance sub-policy) in the unified behavior policy. We empirically explore the first issue in two discrete Markov Decision Process (MDP) environments, elucidating the crucial advancements of the DMP approach and the necessity for soft treatments on the hard operators. For the second issue, we propose a probabilistic classifier based on the ratio of the pain-seeking sub-policy to the sum of the pain-seeking and goal-reaching sub-policies. This classifier assigns roll-outs to separate replay buffers for updating reward and punishment action-value functions, respectively. Our framework demonstrates superior performance in Turtlebot 3's maze navigation tasks under the ROS Gazebo simulation.

Reward-Punishment Reinforcement Learning with Maximum Entropy

TL;DR

vary across the real line with entropy coefficients

and

, enabling interpolation among max, mellow-max, mean, and min operators; it also leverages a flipped pain-avoidance policy and a GAN-inspired discriminator to route experiences to separate replay buffers for learning the reward and punishment modules. The authors provide theoretical grounding and empirical validation in discrete Grid-world tasks and ROS Gazebo-based Turtlebot3 navigation, showing that the min operator paired with the flipped policy can better propagate negative signals and that buffer separation improves data efficiency and robustness. This framework offers a flexible, operator-smoothed approach to reward-punishment RL with practical impact on real-world robotic navigation under uncertainty.

Abstract

Paper Structure (13 sections, 15 equations, 7 figures)

This paper contains 13 sections, 15 equations, 7 figures.

Introduction
Related Work
Separating Reward and Punishment
Maximum Entropy Reinforcement Learning
Preliminaries: Soft Q-learning
Proposed Method: softDMP
Generalization of DMP's Objective Functions
Behavior Policy Fusion
Separate Replay Buffer
Experiments
Grid-world
Gazebo Navigation
Conclusions

Figures (7)

Figure 1: Optimal state value $V^*$ learned by "max" and "min" operators with Q Value Iteration in 9x9 U-maze Grid-world (Note that black bar signifies "stop" action)
Figure 2: Two-end optimal policies derived from optimal action values $Q*$ by "max" and "min" operators with QVI in 9x9 U-maze Grid-world
Figure 3: Smoothed Q-learning curves with "min" operator with flipped policy and "max" operator with optimal policy in terms of step lengths and rewards gained at each episode in 9x9 U-maze Grid-world
Figure 4: 1x21 Chain Environment
Figure 5: Optimal state values learned by SQL with entropy parameter $\eta \in \{-\infty, -1000,-100,-10,-1,-0.1,-0.01,0,0.01,0.1,$$1,10,100,1000,\infty\}$ in 1x21 Chain environment
...and 2 more figures

Reward-Punishment Reinforcement Learning with Maximum Entropy

TL;DR

Abstract

Reward-Punishment Reinforcement Learning with Maximum Entropy

Authors

TL;DR

Abstract

Table of Contents

Figures (7)