To the Max: Reinventing Reward in Reinforcement Learning
Grigorii Veviurko, Wendelin Böhmer, Mathijs de Weerdt
TL;DR
This work tackles reward design in reinforcement learning by proposing max-reward RL, which optimizes the maximum reward achieved in an episode rather than the cumulative discounted return. It builds a theoretically grounded framework using an extended max-reward MDP with an auxiliary variable $y$, establishes a Bellman-like contraction, and derives policy gradient theorems enabling PPO and TD3 to operate in the max-reward setting. The authors demonstrate strong empirical gains on goal-reaching tasks (Maze and Fetch) under sparse or surrogate dense rewards and show robustness to stochastic environments. The approach offers a practical alternative to reward shaping, with potential integration with existing reward-design strategies to improve sample efficiency and performance in real-world tasks.
Abstract
In reinforcement learning (RL), different reward functions can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach for using rewards for learning. We introduce \textit{max-reward RL}, where an agent optimizes the maximum rather than the cumulative reward. Unlike earlier works, our approach works for deterministic and stochastic environments and can be easily combined with state-of-the-art RL algorithms. In the experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics and demonstrate its benefits over standard RL. The code is available at https://github.com/veviurko/To-the-Max.
