Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL
Erfan Miahi, Eugene Belilovsky
TL;DR
This work identifies a striking property of RL fine-tuning: weight updates are extremely sparse (≈$99\%$ of parameters unchanged per step) due to the interaction of BF16 precision with small RL learning rates, while gradients remain dense. It then introduces PULSE, a lossless patch-based synchronization scheme that transmits only changed parameter indices and values, avoiding floating-point drift by using direct value patches and preserving bit-identical reconstruction across multi-hop transfers. Empirical results show PULSE delivers over $100\times$ bandwidth reduction in a decentralized RL network while maintaining identical training dynamics and performance compared to full synchronization. The approach enables decentralized RL to approach centralized throughput on commodity networks, significantly reducing the bandwidth bottleneck for large-scale post-training RL pipelines. Practical implications include robust asynchronous operation, integrity verification, and bandwidth-aware compression choices, with clear avenues for extending to other RL algorithms and longer-running post-training scenarios.
Abstract
Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.
