Table of Contents
Fetching ...

CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling

Zekai Qu, Yinxu Pan, Ao Sun, Chaojun Xiao, Xu Han

TL;DR

CoPRIS addresses the inefficiency of fully synchronous RL post-training by enforcing a fixed concurrency level, early-terminating rollouts, and reusing unfinished trajectories. It couples Concurrency-Controlled Generation with Cross-stage Importance Sampling Correction to reuse work and correct off-policy bias using concatenated log-probabilities, enabling stable updates. Empirical results show $1.58\times$–$1.94\times$ speedups across multiple models and context lengths while maintaining or improving performance on challenging math benchmarks, and it scales favorably with longer contexts and larger models. The approach provides a practical, hardware-aware pathway to faster RL training for large language models, with dynamic concurrency and IS correction as key enablers of scalability and stability.

Abstract

Reinforcement learning (RL) post-training has become a trending paradigm for enhancing the capabilities of large language models (LLMs). Most existing RL systems for LLMs operate in a fully synchronous manner, where training must wait for the rollout of an entire batch to complete. This design leads to severe inefficiencies, as extremely long trajectories can stall the entire rollout process and leave many GPUs idle. To address this issue, we propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS), which mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts. To mitigate the impact of off-policy trajectories, we introduce Cross-stage Importance Sampling Correction, which concatenates buffered log probabilities from the previous policy with those recomputed under the current policy for importance sampling correction. Experiments on challenging mathematical reasoning benchmarks show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems. The code of CoPRIS is available at https://github.com/777pomingzi/CoPRIS.

CoPRIS: Efficient and Stable Reinforcement Learning via Concurrency-Controlled Partial Rollout with Importance Sampling

TL;DR

CoPRIS addresses the inefficiency of fully synchronous RL post-training by enforcing a fixed concurrency level, early-terminating rollouts, and reusing unfinished trajectories. It couples Concurrency-Controlled Generation with Cross-stage Importance Sampling Correction to reuse work and correct off-policy bias using concatenated log-probabilities, enabling stable updates. Empirical results show speedups across multiple models and context lengths while maintaining or improving performance on challenging math benchmarks, and it scales favorably with longer contexts and larger models. The approach provides a practical, hardware-aware pathway to faster RL training for large language models, with dynamic concurrency and IS correction as key enablers of scalability and stability.

Abstract

Reinforcement learning (RL) post-training has become a trending paradigm for enhancing the capabilities of large language models (LLMs). Most existing RL systems for LLMs operate in a fully synchronous manner, where training must wait for the rollout of an entire batch to complete. This design leads to severe inefficiencies, as extremely long trajectories can stall the entire rollout process and leave many GPUs idle. To address this issue, we propose Concurrency- Controlled Partial Rollout with Importance Sampling (CoPRIS), which mitigates long-tail inefficiencies by maintaining a fixed number of concurrent rollouts, early-terminating once sufficient samples are collected, and reusing unfinished trajectories in subsequent rollouts. To mitigate the impact of off-policy trajectories, we introduce Cross-stage Importance Sampling Correction, which concatenates buffered log probabilities from the previous policy with those recomputed under the current policy for importance sampling correction. Experiments on challenging mathematical reasoning benchmarks show that CoPRIS achieves up to 1.94x faster training while maintaining comparable or superior performance to synchronous RL systems. The code of CoPRIS is available at https://github.com/777pomingzi/CoPRIS.

Paper Structure

This paper contains 19 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: RL training traces for a single training step of DeepSeek-R1-Distill-Qwen-7B. Experiments are conducted on DeepScaleR using 32 NVIDIA H800 GPUs with a maximum context length of 16k tokens. The wall-clock GPU utilization trajectories of 8 GPUs are visualized.
  • Figure 2: Illustration of rollout management in CoPRIS. The number of concurrent generations remains constant and is independent of the training batch size. The buffer stores incomplete trajectories together with their corresponding log probabilities under the policy, enabling subsequent continuation and importance sampling correction.
  • Figure 3: Scalability of CoPRIS. Throughput and speedup comparison with veRL under different context lengths and model sizes.
  • Figure 4: Ablation results of the Cross-stage Importance Sampling Correction across two model scales. The top row corresponds to Distill-Qwen-1.5B and the bottom row to Distill-Qwen-7B, showing model performance on AIME24 (left) and AIME25 (right).