Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Erfan Miahi; Eugene Belilovsky

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Erfan Miahi, Eugene Belilovsky

TL;DR

This work identifies a striking property of RL fine-tuning: weight updates are extremely sparse (≈$99\%$ of parameters unchanged per step) due to the interaction of BF16 precision with small RL learning rates, while gradients remain dense. It then introduces PULSE, a lossless patch-based synchronization scheme that transmits only changed parameter indices and values, avoiding floating-point drift by using direct value patches and preserving bit-identical reconstruction across multi-hop transfers. Empirical results show PULSE delivers over $100\times$ bandwidth reduction in a decentralized RL network while maintaining identical training dynamics and performance compared to full synchronization. The approach enables decentralized RL to approach centralized throughput on commodity networks, significantly reducing the bandwidth bottleneck for large-scale post-training RL pipelines. Practical implications include robust asynchronous operation, integrity verification, and bandwidth-aware compression choices, with clear avenues for extending to other RL algorithms and longer-running post-training scenarios.

Abstract

Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

TL;DR

This work identifies a striking property of RL fine-tuning: weight updates are extremely sparse (≈

of parameters unchanged per step) due to the interaction of BF16 precision with small RL learning rates, while gradients remain dense. It then introduces PULSE, a lossless patch-based synchronization scheme that transmits only changed parameter indices and values, avoiding floating-point drift by using direct value patches and preserving bit-identical reconstruction across multi-hop transfers. Empirical results show PULSE delivers over

bandwidth reduction in a decentralized RL network while maintaining identical training dynamics and performance compared to full synchronization. The approach enables decentralized RL to approach centralized throughput on commodity networks, significantly reducing the bandwidth bottleneck for large-scale post-training RL pipelines. Practical implications include robust asynchronous operation, integrity verification, and bandwidth-aware compression choices, with clear avenues for extending to other RL algorithms and longer-running post-training scenarios.

Abstract

Paper Structure (127 sections, 3 theorems, 27 equations, 14 figures, 9 tables, 4 algorithms)

This paper contains 127 sections, 3 theorems, 27 equations, 14 figures, 9 tables, 4 algorithms.

Introduction
Background and Problem Formulation
RL for Reasoning with GRPO
Off-Policy Considerations in Distributed Training
Characterizing Weight Update Sparsity
Experimental Setup
Models.
Training algorithm.
Dataset.
Reward.
Training duration.
Sparsity metric.
How Sparse Are Updates Throughout Training?
Scale and architecture.
Stability throughout training.
...and 112 more sections

Key Result

Theorem A.4

For the Adam optimizer with parameters $(\beta_1, \beta_2)$ where $\beta_2 > \beta_1$, the update magnitude at step $t$ satisfies: As $t \to \infty$, this simplifies to:

Figures (14)

Figure 1: Compute utilization vs. network bandwidth for a 7B model with 50s/step compute time. Full weight synchronization (14 GB) requires 20 Gbit/s links for 90% GPU utilization. PULSE reduces this to 0.2 Gbit/s (a 100$\times$ reduction) by transmitting only the 1% of parameters that change (140 MB). This enables efficient training over standard network connections, where the global median fixed broadband speed is low. We validate this in a live decentralized network in \ref{['sec:experiments']}.
Figure 2: Weight update sparsity across model scales and families. (a) Mean per-step sparsity (%) averaged over 400 training steps. Error bars indicate $\pm$1 standard deviation across steps. (b) Sparsity when comparing $\theta_t$ to $\theta_{t+k}$ for increasing $k$. Shaded regions indicate $\pm$1 standard deviation. Within the recommended $k \leq 8$ range for asynchronous RL scalerl, sparsity remains above 98% for all models.
Figure 3: Why most weights cannot be updated in BF16. The diagonal line shows the minimum update size needed to change a weight (larger weights require larger updates). Horizontal lines show Adam update bounds at learning rate $3 \times 10^{-6}$: the effective bound ($\eta$) and the absorption bound ($10\eta$). The shaded region marks weights beyond the absorption bound, which are permanently frozen. Gray dots show that most LLM weights fall in this region, explaining the observed sparsity.
Figure 4: Factors affecting weight update sparsity. (a) Impact of learning rate: each line shows $k$-step sparsity as a function of learning rate. Higher learning rates reduce sparsity by increasing update magnitudes above the BF16 absorption threshold. (b) Impact of policy staleness: $k$-step sparsity as a function of gradient steps between weight synchronizations. Higher staleness modestly degrades sparsity but the effect is small. Shaded regions indicate $\pm 1$ standard deviation across training steps.
Figure 5: PULSE synchronization topology. Training nodes use high-bandwidth interconnects for dense gradient communication. Sparse weight patches are published to a central relay, enabling inference nodes to synchronize over commodity networks.
...and 9 more figures

Theorems & Definitions (8)

Definition 2.1: Off-Policy Delay
Definition A.1: Weight Update
Definition A.2: Update Sparsity
Definition A.3: Update Absorption
Theorem A.4: Adam Update Upper Bound
proof
Corollary A.5: Weight Magnitude Threshold for BF16
Proposition A.6: Lossless Reconstruction

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

TL;DR

Abstract

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (8)