Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

Yongsheng Lian

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

Yongsheng Lian

TL;DR

This paper methodically compares three RL-based fine-tuning strategies—PPO, GRPO, and DAPO—for improving complex reasoning in LLMs, using a controlled transfer setup where models are first trained on the Countdown Game and then evaluated on a suite of reasoning benchmarks. It provides a detailed parametric study of hyperparameters (entropy, learning rate), structural choices (group size, loss granularity), and techniques (dynamic sampling) to understand their impact on stability and performance. The key findings are that larger group sizes generally stabilize learning and improve accuracy, token-level loss (as in DAPO) helps mitigate reward hacking and supports longer reasoning chains, and the dynamic sampling strategy in DAPO does not consistently improve task performance. Across GSM8K, MATH, BBH, and MMLU-Pro, RL fine-tuning yields gains over the base model, with DAPO No DS often delivering the strongest improvements, offering practical guidance for RL-based LLM training in reasoning-heavy settings.

Abstract

This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

TL;DR

Abstract

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)