Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning
Shuyu Yin, Fei Wen, Peilin Liu, Tao Luo
TL;DR
The paper tackles objective misalignment in reinforcement learning, showing that simply increasing the discount factor $\gamma$ cannot guarantee alignment between maximizing total return and discounted return in environments with cyclic states. It derives a suboptimality bound revealing a non-vanishing term depending on $(1-\gamma)$ and $L_{\max}$, which motivates two alignment strategies: treating the terminal state value $\mathcal{V}_T$ as a tunable hyper-parameter under theoretical conditions, and calibrating trajectory rewards to ensure modified discounted returns monotonically track total returns for off-policy learning. The authors provide sufficient conditions for alignment across positive, negative, and constant reward settings, along with lemmas on terminal accessibility and absorption probabilities, and validate the theory with experiments using DQN, A2C, TD3, and SAC across multiple tasks including long-horizon environments. A practical trajectory reward data calibration method is proposed, demonstrating robust improvements in performance and discount-factor robustness, and suggesting new proxy designs that better reflect the total return.
Abstract
The optimal objective is a fundamental aspect of reinforcement learning (RL), as it determines how policies are evaluated and optimized. While total return maximization is the ideal objective in RL, discounted return maximization is the practical objective due to its stability. This can lead to a misalignment of objectives. To better understand the problem, we theoretically analyze the performance gap between the policy maximizes the total return and the policy maximizes the discounted return. Our analysis reveals that increasing the discount factor can be ineffective at eliminating this gap when environment contains cyclic states,a frequent scenario. To address this issue, we propose two alternative approaches to align the objectives. The first approach achieves alignment by modifying the terminal state value, treating it as a tunable hyper-parameter with its suitable range defined through theoretical analysis. The second approach focuses on calibrating the reward data in trajectories, enabling alignment in practical Deep RL applications using off-policy algorithms. This method enhances robustness to the discount factor and improve performance when the trajectory length is large. Our proposed methods demonstrate that adjusting reward data can achieve alignment, providing an insight that can be leveraged to design new optimization objectives to fundamentally enhance the performance of RL algorithms.
