Table of Contents
Fetching ...

Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Chen Li, Nazhou Liu, Kai Yang

TL;DR

This work addresses instability and token inefficiency in Group Relative Policy Optimization (GRPO) used for RLHF-based reasoning in large language models. It introduces Adaptive Group Policy Optimization (AGPO), featuring an adaptive loss with loss mask and loss clip to prevent zero-advantage signals and entropy collapse, thereby stabilizing training and reducing reasoning tokens. Empirical results on Qwen 2.5-7B/14B across math and code tasks show AGPO delivers higher Pass@1 and substantially fewer tokens compared with GRPO, with ablations confirming the necessity of both components. The approach demonstrates improved generalization to code reasoning benchmarks (LiveCodeBench) and offers a practical path toward more token-efficient, stable RLHF for reasoning LLMs.

Abstract

Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in advantage estimation. Thus, we propose Adaptive Group Policy Optimization (AGPO) which uses a simple but effective method, an adaptive loss function, to mitigate training fluctuation and token inefficiency. The experiments demonstrate our method achieves more stable training and superior performance with significantly fewer tokens in reasoning steps.

Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

TL;DR

This work addresses instability and token inefficiency in Group Relative Policy Optimization (GRPO) used for RLHF-based reasoning in large language models. It introduces Adaptive Group Policy Optimization (AGPO), featuring an adaptive loss with loss mask and loss clip to prevent zero-advantage signals and entropy collapse, thereby stabilizing training and reducing reasoning tokens. Empirical results on Qwen 2.5-7B/14B across math and code tasks show AGPO delivers higher Pass@1 and substantially fewer tokens compared with GRPO, with ablations confirming the necessity of both components. The approach demonstrates improved generalization to code reasoning benchmarks (LiveCodeBench) and offers a practical path toward more token-efficient, stable RLHF for reasoning LLMs.

Abstract

Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in advantage estimation. Thus, we propose Adaptive Group Policy Optimization (AGPO) which uses a simple but effective method, an adaptive loss function, to mitigate training fluctuation and token inefficiency. The experiments demonstrate our method achieves more stable training and superior performance with significantly fewer tokens in reasoning steps.

Paper Structure

This paper contains 14 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Actor entropy curves of GRPO and AGPO for Qwen-2.5-7B
  • Figure 2: Actor entropy curves of GRPO and AGPO for Qwen-2.5-14B
  • Figure 3: Response length curves of GRPO and AGPO for Qwen2.5-7B
  • Figure 4: Response length curves of GRPO and AGPO for Qwen2.5-14B
  • Figure 5: Reward score curves of GRPO and AGPO for Qwen2.5-7B
  • ...and 1 more figures