Table of Contents
Fetching ...

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

Yongsheng Lian

TL;DR

This paper methodically compares three RL-based fine-tuning strategies—PPO, GRPO, and DAPO—for improving complex reasoning in LLMs, using a controlled transfer setup where models are first trained on the Countdown Game and then evaluated on a suite of reasoning benchmarks. It provides a detailed parametric study of hyperparameters (entropy, learning rate), structural choices (group size, loss granularity), and techniques (dynamic sampling) to understand their impact on stability and performance. The key findings are that larger group sizes generally stabilize learning and improve accuracy, token-level loss (as in DAPO) helps mitigate reward hacking and supports longer reasoning chains, and the dynamic sampling strategy in DAPO does not consistently improve task performance. Across GSM8K, MATH, BBH, and MMLU-Pro, RL fine-tuning yields gains over the base model, with DAPO No DS often delivering the strongest improvements, offering practical guidance for RL-based LLM training in reasoning-heavy settings.

Abstract

This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

TL;DR

This paper methodically compares three RL-based fine-tuning strategies—PPO, GRPO, and DAPO—for improving complex reasoning in LLMs, using a controlled transfer setup where models are first trained on the Countdown Game and then evaluated on a suite of reasoning benchmarks. It provides a detailed parametric study of hyperparameters (entropy, learning rate), structural choices (group size, loss granularity), and techniques (dynamic sampling) to understand their impact on stability and performance. The key findings are that larger group sizes generally stabilize learning and improve accuracy, token-level loss (as in DAPO) helps mitigate reward hacking and supports longer reasoning chains, and the dynamic sampling strategy in DAPO does not consistently improve task performance. Across GSM8K, MATH, BBH, and MMLU-Pro, RL fine-tuning yields gains over the base model, with DAPO No DS often delivering the strongest improvements, offering practical guidance for RL-based LLM training in reasoning-heavy settings.

Abstract

This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.

Paper Structure

This paper contains 35 sections, 32 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Model performance on different benchmarks. DAPO with no dynamic sampling.
  • Figure 2: Clip function illustrated
  • Figure 3: The effect of entropy bonus on PPO training. (a): The entropy bonus increases the fraction of clipped tokens during training. (b): The entropy bonus enhances the KL divergence between the new and old policies to favor exploration. (c): Model Accuracy performance. Surprisingly, adding an entropy bonus leads to lower accuracy compared to training without entropy regularization.
  • Figure 4: Model performance on GSM8k with different learning rates. A smaller learning rate of $1 \times 10^{-6}$ leads to more stable training, while a larger learning rate results in higher fluctuations, although the model achieves higher accuracy.
  • Figure 5: GRPO performance on Countdown with different group sizes. A larger group size $G$ leads to better model accuracy and lower KL divergence at higher computation cost.
  • ...and 3 more figures