Reinforcement Learning via Conservative Agent for Environments with Random Delays
Jongsoo Lee, Jangwon Kim, Jiseok Jeong, Soohee Han
TL;DR
This work addresses reinforcement learning under random delays that can break the Markov property by introducing a conservative agent that constructs a constant-delay surrogate using the maximum delay $\Delta_{\max}$. This enables any constant-delay RL method to be applied directly to random-delay environments without modeling the delay distribution, and it yields delay-agnostic performance as long as $\Delta_{\max}$ remains fixed. The authors provide theoretical bounds on the performance gap between policies in random-delay and constant-delay settings, and show that the conservative approach achieves near-identical performance to optimal constant-delay policies under bounded delays. Empirically, conservative-BPQL (and Conservative-VDPO) outperform state-of-the-art baselines on MuJoCo tasks with $\Delta_{\max} \in \{5,10,20\}$, demonstrating superior asymptotic performance and sample efficiency. Overall, the paper offers a practical, distribution-agnostic framework that unifies constant-delay and random-delay RL, with broad applicability to real-world delayed feedback scenarios.
Abstract
Real-world reinforcement learning applications are often hindered by delayed feedback from environments, which violates the Markov assumption and introduces significant challenges. Although numerous delay-compensating methods have been proposed for environments with constant delays, environments with random delays remain largely unexplored due to their inherent variability and unpredictability. In this study, we propose a simple yet robust agent for decision-making under random delays, termed the conservative agent, which reformulates the random-delay environment into its constant-delay equivalent. This transformation enables any state-of-the-art constant-delay method to be directly extended to the random-delay environments without modifying the algorithmic structure or sacrificing performance. We evaluate the conservative agent-based algorithm on continuous control tasks, and empirical results demonstrate that it significantly outperforms existing baseline algorithms in terms of asymptotic performance and sample efficiency.
