An Empirical Study of Deep Reinforcement Learning in Continuing Tasks
Yi Wan, Dmytro Korenkevych, Zheqing Zhu
TL;DR
This paper tackles the challenge of applying deep RL to continuing tasks, where resets may be unavailable or agent-controlled. It provides an extensive empirical study of popular algorithms (DDPG, TD3, SAC, PPO, DQN) across diverse Mujoco and Atari testbeds with varying reset regimes, revealing that predefined resets greatly aid learning while agent-controlled resets can be harder. A key contribution is the demonstration that temporal-difference–based reward centering substantially improves performance across algorithms and scales to larger tasks, effectively mitigating reward offsets and mitigating some negative effects of high discount factors. The findings offer practical guidance for deploying deep RL in real-world, non-episodic settings and point to reward-centering as a robust tool for continuing tasks, while acknowledging limitations and directions for future work.
Abstract
In reinforcement learning (RL), continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards-including those beyond resets-are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored. To address this gap, we provide an empirical study of several well-known deep RL algorithms using a suite of continuing task testbeds based on Mujoco and Atari environments, highlighting several key insights concerning continuing tasks. Using these testbeds, we also investigate the effectiveness of a method for improving temporal-difference-based RL algorithms in continuing tasks by centering rewards, as introduced by Naik et al. (2024). While their work primarily focused on this method in conjunction with Q-learning, our results extend their findings by demonstrating that this method is effective across a broader range of algorithms, scales to larger tasks, and outperforms two other reward-centering approaches.
