RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

Chaoyi Ruan; Geng Luo; Xinyi Wan; Long Zhao; Qinghe Wang; Jiaan Zhu; Duling Xu; Guanbin Xu; Dehui Wei; Xiang Liu; Cheng Li; Haifeng Sun; Congcong Miao; Jialin Li

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

Chaoyi Ruan, Geng Luo, Xinyi Wan, Long Zhao, Qinghe Wang, Jiaan Zhu, Duling Xu, Guanbin Xu, Dehui Wei, Xiang Liu, Cheng Li, Haifeng Sun, Congcong Miao, Jialin Li

TL;DR

This paper tackles the bandwidth bottleneck of RL post-training over commodity networks by exploiting the observed fine-grained sparsity in RL updates, typically around $\rho\approx 1\%$ of parameters per step. It introduces SparrowRL, which unifies storage and transfer through lossless sparse delta checkpoints and uses streaming, relay-based fanout, and heterogeneity-aware scheduling to hide communication latency within the one-step lag bound. The approach yields up to $79\times$ reduction in per-step transfer and $2.4$–$9.5\times$ throughput gains relative to full-weight broadcasts, with a remaining gap to ideal RDMA benchmarks of only $8.91\%$, and it delivers $1.21\times$–$1.59\times$ tokens per dollar over cross-cloud deployments. Overall, SparrowRL enables practical, cost-effective RL post-training on commodity networks, broadening access to large-scale LLM fine-tuning across diverse providers and regions.

Abstract

LLM post-training with reinforcement learning (RL) requires frequent synchronization of large model parameters between the trainer and distributed rollout actors. High-throughput RL post-training therefore relies on dedicated RDMA HPC clusters, an infrastructure cost most organizations cannot absorb. A natural alternative is to aggregate loosely-coupled GPUs over standard Ethernet and WAN links, but this commodity connectivity cannot sustain full-weight broadcasts: synchronizing an 8B model can take over 100~seconds on bandwidth-limited links, while rollout generation typically takes tens of seconds. Toward making RL practical in this regime, we observe that RL fine-tuning yields highly sparse per-step updates, with only around 1\% of parameter elements changing. Atop this insight, we present SparrowRL, a novel high-performance RL training system that preserves bit-exact updates without dropping or quantizing information, designed for commodity-networked, loosely-coupled GPU resources. SparrowRL represents each step as a sparse delta checkpoint, pipelines delta extraction with multi-stream transmission, overlaps transfer with rollout generation, and coordinates heterogeneous workers with throughput- and bandwidth-aware scheduling plus lease-based fault tolerance. On Qwen3 models from 4B to 14B deployed across up to four geographic regions, SparrowRL reduces per-step transfer payload by 79$\times$ for Qwen3-8B and improves throughput by 2.4--9.5$\times$ over full-weight broadcast across WAN, narrowing the throughput gap relative to an ideal RDMA single-datacenter baseline to within 8.91\%. By leveraging on-demand, cross-cloud GPUs over commodity links, SparrowRL delivers 1.21--1.59$\times$ higher tokens per dollar than reserved RDMA clusters at comparable throughput.

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

TL;DR

This paper tackles the bandwidth bottleneck of RL post-training over commodity networks by exploiting the observed fine-grained sparsity in RL updates, typically around

of parameters per step. It introduces SparrowRL, which unifies storage and transfer through lossless sparse delta checkpoints and uses streaming, relay-based fanout, and heterogeneity-aware scheduling to hide communication latency within the one-step lag bound. The approach yields up to

reduction in per-step transfer and

–

throughput gains relative to full-weight broadcasts, with a remaining gap to ideal RDMA benchmarks of only

, and it delivers

–

tokens per dollar over cross-cloud deployments. Overall, SparrowRL enables practical, cost-effective RL post-training on commodity networks, broadening access to large-scale LLM fine-tuning across diverse providers and regions.

Abstract

for Qwen3-8B and improves throughput by 2.4--9.5

over full-weight broadcast across WAN, narrowing the throughput gap relative to an ideal RDMA single-datacenter baseline to within 8.91\%. By leveraging on-demand, cross-cloud GPUs over commodity links, SparrowRL delivers 1.21--1.59

higher tokens per dollar than reserved RDMA clusters at comparable throughput.

Paper Structure (23 sections, 1 equation, 13 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 13 figures, 7 tables, 1 algorithm.

Introduction
Background and Motivation
Reinforcement Learning for LLMs
The HPC-Centric Status Quo
Loosely Coupled GPU Resources: Opportunity and Challenges
Key Insight and Design Principles
Overview of SparrowRL
System Design
Lossless Sparse Delta Geo-Checkpoints
Streaming Delta Transfer Protocol
Heterogeneity-Aware Scheduling
Fault Tolerance
Discussion
Experiment
Experimental Setup
...and 8 more sections

Figures (13)

Figure 1: RL training architecture for LLMs. The Trainer holds the policy and auxiliary models; Rollout Actors generate rollouts from prompts. Updated weights are transferred every iteration.
Figure 2: Two paradigms for RL training. Left: tightly coupled clusters with RDMA interconnects (100--800 Gbps) and high cost. Right: loosely coupled resources across labs and clouds, connected via cross-cloud links (1--10 Gbps).
Figure 3: Fraction of nonzero parameter updates after one RL step across different models.
Figure 4: Analysis of training dynamics. (a) Visualization of weight update distribution, demonstrating the sparse nature of parameter updates. (b) Weight update sparsity and (c) training reward throughout RL training. We train 4B/8B models on GSM8K cobbe2021gsm8k and DeepScaleR deepscaler2025 datasets for 800 rollout steps with a fixed learning rate of $1\times10^{-6}$.
Figure 5: SparrowRL architecture overview. The Trainer Hub posts jobs and sparse deltas; Rollout Actors claim prompts, generate samples, and return results. Each region contains a Relay that receives deltas from the Trainer and forwards them to peer Rollout Actors.
...and 8 more figures

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

TL;DR

Abstract

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

Authors

TL;DR

Abstract

Table of Contents

Figures (13)