To the Max: Reinventing Reward in Reinforcement Learning

Grigorii Veviurko; Wendelin Böhmer; Mathijs de Weerdt

To the Max: Reinventing Reward in Reinforcement Learning

Grigorii Veviurko, Wendelin Böhmer, Mathijs de Weerdt

TL;DR

This work tackles reward design in reinforcement learning by proposing max-reward RL, which optimizes the maximum reward achieved in an episode rather than the cumulative discounted return. It builds a theoretically grounded framework using an extended max-reward MDP with an auxiliary variable $y$, establishes a Bellman-like contraction, and derives policy gradient theorems enabling PPO and TD3 to operate in the max-reward setting. The authors demonstrate strong empirical gains on goal-reaching tasks (Maze and Fetch) under sparse or surrogate dense rewards and show robustness to stochastic environments. The approach offers a practical alternative to reward shaping, with potential integration with existing reward-design strategies to improve sample efficiency and performance in real-world tasks.

Abstract

In reinforcement learning (RL), different reward functions can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach for using rewards for learning. We introduce \textit{max-reward RL}, where an agent optimizes the maximum rather than the cumulative reward. Unlike earlier works, our approach works for deterministic and stochastic environments and can be easily combined with state-of-the-art RL algorithms. In the experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics and demonstrate its benefits over standard RL. The code is available at https://github.com/veviurko/To-the-Max.

To the Max: Reinventing Reward in Reinforcement Learning

TL;DR

, establishes a Bellman-like contraction, and derives policy gradient theorems enabling PPO and TD3 to operate in the max-reward setting. The authors demonstrate strong empirical gains on goal-reaching tasks (Maze and Fetch) under sparse or surrogate dense rewards and show robustness to stochastic environments. The approach offers a practical alternative to reward shaping, with potential integration with existing reward-design strategies to improve sample efficiency and performance in real-world tasks.

Abstract

Paper Structure (16 sections, 5 theorems, 46 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 5 theorems, 46 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Related work
Background
Deterministic max-reward RL
Chain environment example.
Max-reward RL
Max-reward objective
Policy gradient theorems
Experiments
Maze with shortest path rewards
Stochastic Maze.
Fetch environment
Conclusions and future work
Proofs
Experimental details
...and 1 more sections

Key Result

Lemma 4.2

Let $y\in{\mathbb R}$ and let $y':=\frac{R(s,a,s_{t+1})\lor y}{\gamma}$. Then, the max-reward value functions are subject to the following Bellman-like equations:

Figures (6)

Figure 1: Five-state chain MDP with three actions (left, stay, right) available in each state and the training results for cumulative (in green) and max-reward (in violet) value iteration. The $y-$ axis is the number of training epochs to recover the optimal policy; the $x-$axis shows the values of the intermediate reward $x.$ Four panels correspond to different probabilities of skipping transitions into $s_4$ during training.
Figure 2: A three-state MDP with deterministic transitions and stochastic rewards. Two different policies, $\pi_1$ and $\pi_2$, share the same first action $a_1$, but then have different $a_2$, thereby resulting in different reward distributions.
Figure 3: Left: Single-goal maze, where the goal (red ball) is always in the same location. Right: Two-goals maze with two spawn locations of the goal (red balls).
Figure 4: Learning curves of TD3, max-reward TD3, PPO, and max-reward PPO on two different mazes. The vertical axis is the success ratio, i.e., whether the goal was reached during the episode. The shaded area is the standard error of the mean. The horizontal axis is the total environmental timesteps in millions. For each maze, we present results for six different reward functions (columns).
Figure 5: Learning curves of TD3, max-reward TD3, deterministic max-reward TD3, PPO, and max-reward PPO on a stochastic version of the single-goal maze with DSP reward, $k=3$. The vertical axis is the success ratio, the shaded area is the standard error of the mean. The horizontal axis is the total environmental timesteps. The results confirm that our max-reward methods work in stochastic environments.
...and 1 more figures

Theorems & Definitions (14)

Definition 4.1
Lemma 4.2
Definition 4.3
Theorem 4.4
Definition 4.5
Definition 4.6
Definition 4.7
Theorem 4.8
Theorem 4.9
Corollary 4.10
...and 4 more

To the Max: Reinventing Reward in Reinforcement Learning

TL;DR

Abstract

To the Max: Reinventing Reward in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (14)