Adaptive Policy Synchronization for Scalable Reinforcement Learning
Rodney Lafuente-Mercado
TL;DR
ClusterEnv tackles the rigidity of traditional distributed RL stacks by decoupling environment execution from learning using the DETACH pattern and introducing Adaptive Policy Synchronization (APS) to throttles policy updates based on divergence. The approach preserves a Gymnasium-compatible API and is compatible with both on-policy and off-policy methods, demonstrated with PPO on LunarLander-v2 across multi-node hardware. The paper formalizes DETACH, provides a divergence-based APS mechanism, and delivers an open-source implementation with SLURM integration, showing reduced synchronization overhead without sacrificing performance. This work offers a modular, high-throughput pathway for scalable DRL that can integrate with existing training pipelines and infrastructure choices, lowering the barrier to large-scale experimentation.
Abstract
Scaling reinforcement learning (RL) often requires running environments across many machines, but most frameworks tie simulation, training, and infrastructure into rigid systems. We introduce ClusterEnv, a lightweight interface for distributed environment execution that preserves the familiar Gymnasium API. ClusterEnv uses the DETACH pattern, which moves environment reset() and step() operations to remote workers while keeping learning centralized. To reduce policy staleness without heavy communication, we propose Adaptive Policy Synchronization (APS), where workers request updates only when divergence from the central learner grows too large. ClusterEnv supports both on- and off-policy methods, integrates into existing training code with minimal changes, and runs efficiently on clusters. Experiments on discrete control tasks show that APS maintains performance while cutting synchronization overhead. Source code is available at https://github.com/rodlaf/ClusterEnv.
