Table of Contents
Fetching ...

An Empirical Study of Deep Reinforcement Learning in Continuing Tasks

Yi Wan, Dmytro Korenkevych, Zheqing Zhu

TL;DR

This paper tackles the challenge of applying deep RL to continuing tasks, where resets may be unavailable or agent-controlled. It provides an extensive empirical study of popular algorithms (DDPG, TD3, SAC, PPO, DQN) across diverse Mujoco and Atari testbeds with varying reset regimes, revealing that predefined resets greatly aid learning while agent-controlled resets can be harder. A key contribution is the demonstration that temporal-difference–based reward centering substantially improves performance across algorithms and scales to larger tasks, effectively mitigating reward offsets and mitigating some negative effects of high discount factors. The findings offer practical guidance for deploying deep RL in real-world, non-episodic settings and point to reward-centering as a robust tool for continuing tasks, while acknowledging limitations and directions for future work.

Abstract

In reinforcement learning (RL), continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards-including those beyond resets-are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored. To address this gap, we provide an empirical study of several well-known deep RL algorithms using a suite of continuing task testbeds based on Mujoco and Atari environments, highlighting several key insights concerning continuing tasks. Using these testbeds, we also investigate the effectiveness of a method for improving temporal-difference-based RL algorithms in continuing tasks by centering rewards, as introduced by Naik et al. (2024). While their work primarily focused on this method in conjunction with Q-learning, our results extend their findings by demonstrating that this method is effective across a broader range of algorithms, scales to larger tasks, and outperforms two other reward-centering approaches.

An Empirical Study of Deep Reinforcement Learning in Continuing Tasks

TL;DR

This paper tackles the challenge of applying deep RL to continuing tasks, where resets may be unavailable or agent-controlled. It provides an extensive empirical study of popular algorithms (DDPG, TD3, SAC, PPO, DQN) across diverse Mujoco and Atari testbeds with varying reset regimes, revealing that predefined resets greatly aid learning while agent-controlled resets can be harder. A key contribution is the demonstration that temporal-difference–based reward centering substantially improves performance across algorithms and scales to larger tasks, effectively mitigating reward offsets and mitigating some negative effects of high discount factors. The findings offer practical guidance for deploying deep RL in real-world, non-episodic settings and point to reward-centering as a robust tool for continuing tasks, while acknowledging limitations and directions for future work.

Abstract

In reinforcement learning (RL), continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards-including those beyond resets-are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored. To address this gap, we provide an empirical study of several well-known deep RL algorithms using a suite of continuing task testbeds based on Mujoco and Atari environments, highlighting several key insights concerning continuing tasks. Using these testbeds, we also investigate the effectiveness of a method for improving temporal-difference-based RL algorithms in continuing tasks by centering rewards, as introduced by Naik et al. (2024). While their work primarily focused on this method in conjunction with Q-learning, our results extend their findings by demonstrating that this method is effective across a broader range of algorithms, scales to larger tasks, and outperforms two other reward-centering approaches.
Paper Structure (15 sections, 8 equations, 7 figures, 18 tables)

This paper contains 15 sections, 8 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Learning curves in continuing testbeds without resets (upper row), with predefined resets (middle row), and with agent-controlled resets (lower row) based on the Mujoco environment. Each point in a curve shows the reward rate averaged over the past $10,000$ steps. The shading area shows one standard error.
  • Figure 2: Evolution of DDPG's visited states in two HumanoidStandup testbeds (upper row) and TD3’s visited states in two Swimmer testbeds (lower row). In both cases, one testbed does not involve resets, while the other one resets with a probability of $0.001$ per time step. We visualize three key elements of the visited states in the first $1$M steps of one run. For HumanoidStandup, all blue dots concentrate on a small suboptimal region, indicating that the agent fails to perform a sufficient amount of exploration without resets. For the Swimmer, the orange circle indicates the swimmer undulates like a snake to move forward, suggesting that the agent finds a decent policy. Without resetting, the agent explores a larger region of the state space but fails to learn a good policy.
  • Figure 3: Learning curves in continuing testbeds with predefined resets based on the Atari environment. Each point shows the reward rate over the past $100$K steps. Shading area standards for one standard error. Overall, DQN performs the best of the three tested algorithms.
  • Figure 4: Learning curves on continuing testbeds without resets based on Mujoco environments. Each point shows the reward rate averaged over the past $10,000$ steps.
  • Figure 5: Learning curves on continuing testbeds with predefined resets based on Mujoco environments. Each point shows the reward rate averaged over the past $10,000$ steps. The shading area standards for one standard error.
  • ...and 2 more figures