Table of Contents
Fetching ...

Adaptive Policy Synchronization for Scalable Reinforcement Learning

Rodney Lafuente-Mercado

TL;DR

ClusterEnv tackles the rigidity of traditional distributed RL stacks by decoupling environment execution from learning using the DETACH pattern and introducing Adaptive Policy Synchronization (APS) to throttles policy updates based on divergence. The approach preserves a Gymnasium-compatible API and is compatible with both on-policy and off-policy methods, demonstrated with PPO on LunarLander-v2 across multi-node hardware. The paper formalizes DETACH, provides a divergence-based APS mechanism, and delivers an open-source implementation with SLURM integration, showing reduced synchronization overhead without sacrificing performance. This work offers a modular, high-throughput pathway for scalable DRL that can integrate with existing training pipelines and infrastructure choices, lowering the barrier to large-scale experimentation.

Abstract

Scaling reinforcement learning (RL) often requires running environments across many machines, but most frameworks tie simulation, training, and infrastructure into rigid systems. We introduce ClusterEnv, a lightweight interface for distributed environment execution that preserves the familiar Gymnasium API. ClusterEnv uses the DETACH pattern, which moves environment reset() and step() operations to remote workers while keeping learning centralized. To reduce policy staleness without heavy communication, we propose Adaptive Policy Synchronization (APS), where workers request updates only when divergence from the central learner grows too large. ClusterEnv supports both on- and off-policy methods, integrates into existing training code with minimal changes, and runs efficiently on clusters. Experiments on discrete control tasks show that APS maintains performance while cutting synchronization overhead. Source code is available at https://github.com/rodlaf/ClusterEnv.

Adaptive Policy Synchronization for Scalable Reinforcement Learning

TL;DR

ClusterEnv tackles the rigidity of traditional distributed RL stacks by decoupling environment execution from learning using the DETACH pattern and introducing Adaptive Policy Synchronization (APS) to throttles policy updates based on divergence. The approach preserves a Gymnasium-compatible API and is compatible with both on-policy and off-policy methods, demonstrated with PPO on LunarLander-v2 across multi-node hardware. The paper formalizes DETACH, provides a divergence-based APS mechanism, and delivers an open-source implementation with SLURM integration, showing reduced synchronization overhead without sacrificing performance. This work offers a modular, high-throughput pathway for scalable DRL that can integrate with existing training pipelines and infrastructure choices, lowering the barrier to large-scale experimentation.

Abstract

Scaling reinforcement learning (RL) often requires running environments across many machines, but most frameworks tie simulation, training, and infrastructure into rigid systems. We introduce ClusterEnv, a lightweight interface for distributed environment execution that preserves the familiar Gymnasium API. ClusterEnv uses the DETACH pattern, which moves environment reset() and step() operations to remote workers while keeping learning centralized. To reduce policy staleness without heavy communication, we propose Adaptive Policy Synchronization (APS), where workers request updates only when divergence from the central learner grows too large. ClusterEnv supports both on- and off-policy methods, integrates into existing training code with minimal changes, and runs efficiently on clusters. Experiments on discrete control tasks show that APS maintains performance while cutting synchronization overhead. Source code is available at https://github.com/rodlaf/ClusterEnv.

Paper Structure

This paper contains 19 sections, 4 equations, 3 figures, 2 algorithms.

Figures (3)

  • Figure 1: The DETACH architecture. Environment simulation is offloaded to distributed workers, while learning remains centralized at the head node. This separation ensures modularity and scalability without imposing rigid synchronization protocols.
  • Figure 2: Learning curves on LunarLander-v2. APS with intermediate KL thresholds (e.g., $\delta=0.05$) achieves strong performance with fewer synchronizations compared to lower thresholds.
  • Figure 3: Cumulative synchronization count per worker. Lower KL thresholds result in more frequent weight pulls, while higher thresholds yield computational savings.