Table of Contents
Fetching ...

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Haodong Zhu, Yangyang Ren, Yanjing Li, Mingbao Lin, Linlin Yang, Xuhui Liu, Xiantong Zhen, Haiguang Liu, Baochang Zhang

TL;DR

Dynamic Pruning Policy Optimization is proposed, a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction and Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization.

Abstract

Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs due to its extensive group-based sampling requirement. While recent selective data utilization methods can mitigate this overhead, they could induce estimation bias by altering the underlying sampling distribution, compromising theoretical rigor and convergence behavior. To address this limitation, we propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction. By incorporating mathematically derived rescaling factors, DPPO significantly accelerates GRPO training without altering the optimization objective of the full-batch baseline. Furthermore, to mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization. Extensive experiments demonstrate that DPPO consistently accelerates training across diverse models and benchmarks. For instance, on Qwen3-4B trained on MATH, DPPO achieves 2.37$\times$ training speedup and outperforms GRPO by 3.36% in average accuracy across six mathematical reasoning benchmarks.

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

TL;DR

Dynamic Pruning Policy Optimization is proposed, a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction and Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization.

Abstract

Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs due to its extensive group-based sampling requirement. While recent selective data utilization methods can mitigate this overhead, they could induce estimation bias by altering the underlying sampling distribution, compromising theoretical rigor and convergence behavior. To address this limitation, we propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction. By incorporating mathematically derived rescaling factors, DPPO significantly accelerates GRPO training without altering the optimization objective of the full-batch baseline. Furthermore, to mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization. Extensive experiments demonstrate that DPPO consistently accelerates training across diverse models and benchmarks. For instance, on Qwen3-4B trained on MATH, DPPO achieves 2.37 training speedup and outperforms GRPO by 3.36% in average accuracy across six mathematical reasoning benchmarks.
Paper Structure (23 sections, 1 theorem, 33 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 1 theorem, 33 equations, 4 figures, 11 tables, 1 algorithm.

Key Result

Theorem 4.1

Under DPPO's hierarchical pruning mechanism, the gradient estimator variance is bounded by: For moderate pruning rates (e.g., $r_q, r_o \leq 0.7$ with $\beta = 0.5$), $\frac{(1-\beta r_q)\left[1 - (1-\beta)r_q\right]}{1-r_q} \leq 1.42$, and the total variance remains well-controlled. For aggressive pruning (e.g., $r_q = r_o = 0.9$), this factor increases to approximately $3.025$, leading t

Figures (4)

  • Figure 1: Comparison of accuracy and training time on the MATH dataset for Qwen3-4B and Qwen3-8B. Bars indicate accuracy and red lines indicate training time. Our method achieves the highest accuracy while requiring the least training time.
  • Figure 2: Overview of our DPPO. It employs a hierarchical pruning strategy to accelerate GRPO by reducing redundancy at both the prompt level (via difficulty estimation $H_t(q)$) and the completion level (via advantage assessment $|A_t|$). In the left panel, a mathematically grounded rescaling mechanism is applied to retained samples to correct for estimation bias. The right panel details the end-to-end training loop, where dynamic pruning and importance-based rescaling are integrated to ensure efficient yet unbiased policy optimization.
  • Figure 3: Dense Prompt Packing Strategy. The left panel shows the window-based greedy algorithm for assembling variable-length prompts into compact sequences. The right panel indicates the mitigation of pruning-induced sparsity. Unlike standard batching, our approach maximizes valid token density and hardware saturation, ensuring throughput remains consistent with the full-batch pattern.
  • Figure 4: Training dynamics of GRPO and DPPO variants on MATH dataset with Qwen3-4B (left) and Qwen3-8B (right).

Theorems & Definitions (1)

  • Theorem 4.1: Total Variance Bound