Periodic Asynchrony: An Effective Method for Accelerating Reinforcement Learning
Jian Lu
TL;DR
The paper tackles inefficient RL training by decoupling inference from training using periodic asynchronous scheduling. It introduces a unified tri-model architecture and a shared-prompt attention mechanism to maintain on-policy accuracy while enabling parallelism and scalable resource usage. Through Megatron-style 3D parallelism and optimized dataflow on NPUs, the approach achieves 3x–5x end-to-end throughput gains with maintained accuracy, and demonstrates near-linear scalability as compute increases. The method is designed to be broadly applicable to on-policy RL, offering a practical path to faster RL training for large models such as LLMs.
Abstract
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
