Table of Contents
Fetching ...

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning

Shuyu Yin, Fei Wen, Peilin Liu, Tao Luo

TL;DR

The paper tackles objective misalignment in reinforcement learning, showing that simply increasing the discount factor $\gamma$ cannot guarantee alignment between maximizing total return and discounted return in environments with cyclic states. It derives a suboptimality bound revealing a non-vanishing term depending on $(1-\gamma)$ and $L_{\max}$, which motivates two alignment strategies: treating the terminal state value $\mathcal{V}_T$ as a tunable hyper-parameter under theoretical conditions, and calibrating trajectory rewards to ensure modified discounted returns monotonically track total returns for off-policy learning. The authors provide sufficient conditions for alignment across positive, negative, and constant reward settings, along with lemmas on terminal accessibility and absorption probabilities, and validate the theory with experiments using DQN, A2C, TD3, and SAC across multiple tasks including long-horizon environments. A practical trajectory reward data calibration method is proposed, demonstrating robust improvements in performance and discount-factor robustness, and suggesting new proxy designs that better reflect the total return.

Abstract

The optimal objective is a fundamental aspect of reinforcement learning (RL), as it determines how policies are evaluated and optimized. While total return maximization is the ideal objective in RL, discounted return maximization is the practical objective due to its stability. This can lead to a misalignment of objectives. To better understand the problem, we theoretically analyze the performance gap between the policy maximizes the total return and the policy maximizes the discounted return. Our analysis reveals that increasing the discount factor can be ineffective at eliminating this gap when environment contains cyclic states,a frequent scenario. To address this issue, we propose two alternative approaches to align the objectives. The first approach achieves alignment by modifying the terminal state value, treating it as a tunable hyper-parameter with its suitable range defined through theoretical analysis. The second approach focuses on calibrating the reward data in trajectories, enabling alignment in practical Deep RL applications using off-policy algorithms. This method enhances robustness to the discount factor and improve performance when the trajectory length is large. Our proposed methods demonstrate that adjusting reward data can achieve alignment, providing an insight that can be leveraged to design new optimization objectives to fundamentally enhance the performance of RL algorithms.

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning

TL;DR

The paper tackles objective misalignment in reinforcement learning, showing that simply increasing the discount factor cannot guarantee alignment between maximizing total return and discounted return in environments with cyclic states. It derives a suboptimality bound revealing a non-vanishing term depending on and , which motivates two alignment strategies: treating the terminal state value as a tunable hyper-parameter under theoretical conditions, and calibrating trajectory rewards to ensure modified discounted returns monotonically track total returns for off-policy learning. The authors provide sufficient conditions for alignment across positive, negative, and constant reward settings, along with lemmas on terminal accessibility and absorption probabilities, and validate the theory with experiments using DQN, A2C, TD3, and SAC across multiple tasks including long-horizon environments. A practical trajectory reward data calibration method is proposed, demonstrating robust improvements in performance and discount-factor robustness, and suggesting new proxy designs that better reflect the total return.

Abstract

The optimal objective is a fundamental aspect of reinforcement learning (RL), as it determines how policies are evaluated and optimized. While total return maximization is the ideal objective in RL, discounted return maximization is the practical objective due to its stability. This can lead to a misalignment of objectives. To better understand the problem, we theoretically analyze the performance gap between the policy maximizes the total return and the policy maximizes the discounted return. Our analysis reveals that increasing the discount factor can be ineffective at eliminating this gap when environment contains cyclic states,a frequent scenario. To address this issue, we propose two alternative approaches to align the objectives. The first approach achieves alignment by modifying the terminal state value, treating it as a tunable hyper-parameter with its suitable range defined through theoretical analysis. The second approach focuses on calibrating the reward data in trajectories, enabling alignment in practical Deep RL applications using off-policy algorithms. This method enhances robustness to the discount factor and improve performance when the trajectory length is large. Our proposed methods demonstrate that adjusting reward data can achieve alignment, providing an insight that can be leveraged to design new optimization objectives to fundamentally enhance the performance of RL algorithms.
Paper Structure (11 sections, 7 theorems, 32 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 11 sections, 7 theorems, 32 equations, 7 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Consider a stochastic MDP with a discrete action space of size $|A| < \infty$ and a maximum trajectory length $L_{\max}$, satisfying Assumptions assum::rewardBound, assum::concentrability, and $\mathcal{V}_T=0$. Then for any $f \in \mathcal{F}_M$, we have: where $\pi_f$ is the policy derived from $f$.

Figures (7)

  • Figure 1: A simple deterministic MDP with cyclic states.
  • Figure 2: Demonstration of the objective misalignment problem. $G$ represents the discounted return. Each point corresponds to a trajectory collected from the environment, with discounted returns calculated using a discount factor of $0.97$. As shown, for a given discounted return, the total return varies widely.
  • Figure 3: Demonstration of trajectories' original discounted returns and those modified by Algorithm \ref{['alg::estimate_hist']}. Each point represents a trajectory: blue points show the original total return (x-axis) and original discounted return (y-axis), while orange points represent the modified discounted return (y-axis) with the original total return preserved on the x-axis. The adjustment ensures that the modified discounted returns increase consistently with the original total returns.
  • Figure 4: Validation results for Theorems \ref{['thm::condition4Optimal']}, \ref{['thm::condition4ONonworst']}, and \ref{['thm::constantRewardOptimality']}. Solid lines indicate the mean trajectory length across five independent experiments (with different random seeds), smoothed using a window size of 50, and shaded regions show the standard deviations. The maximum trajectory length is setting to 500. In subfigure (1) (positive reward), a non-zero terminal state value yields the optimal policy with maximal trajectory length, substantiating Theorem \ref{['thm::condition4Optimal']}. Subfigure (2) (negative reward) demonstrates that a non-zero terminal state value generates a non-worst policy with trajectory length approximating 100, validating Theorem \ref{['thm::condition4ONonworst']}. Subfigure (3) ($+1$ reward) reveals that zero terminal state value escalates trajectory length to 500, signifying the optimal policy, whereas a non-zero value reduces trajectory length to zero, representing the worst policy, consistent with Theorem \ref{['thm::constantRewardOptimality']} (1). In subfigure (4) ($-1$ reward), zero terminal value diminishes trajectory length below 100, characterizing the optimal policy, while a non-zero value maintains it around 500, reflecting the worst policy, thus corroborating Theorem \ref{['thm::constantRewardOptimality']} (2).
  • Figure 5: Comparison of performance under different discount factors and their combination with our method. Solid lines show the mean performance over five independent experiments (with different random seeds), smoothed using a window size of 50, while shaded regions represent the corresponding standard deviations. Our method significantly improves performance in all experiments with $\gamma = 0.97$, particularly in the LunarLander and LunarLanderContinuous environments, where results are comparable to those with $\gamma = 0.99$. Furthermore, in the Hopper environment, our method achieves additional performance gains with $\gamma = 0.99$.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Example 1: MDP with cyclic states
  • Remark 1
  • Theorem 1: suboptimality
  • proof
  • Lemma 1: terminal state values reverse relation
  • Lemma 2: maximum access distance
  • proof
  • Lemma 3: monotonicity of absorption probability
  • proof
  • Remark 2
  • ...and 8 more