Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Xinchen Han; Hossam Afifi; Michel Marot; Xilu Wang; Lu Yin

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin

TL;DR

This paper tackles the inefficiency of verbose Chain-of-Thought in large language models by introducing Fine-grained Group Policy Optimization (FGO), an RL-based method that refines and compresses group responses through fine-grained rewards based on length and entropy. By extending Group Relative Policy Optimization (GRPO) with subgrouping into correct and incorrect responses and applying length-entropy weighted shaping, FGO achieves substantial CoT compression while preserving or improving reasoning accuracy across math benchmarks such as MATH500, AIME24, AMC23, and Minerva. Empirically, FGO delivers 100% data utilization, mitigates entropy collapse, and demonstrates robustness across multiple models; ablation studies pinpoint the critical role of the length-entropy balance (with $\alpha=0.01$ often yielding the best results). The work offers a practical approach to efficient reasoning with LLMs, enabling faster inference and lower costs without sacrificing performance, and suggests future directions for more accurate advantage estimation with fewer group samples.

Abstract

Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

TL;DR

often yielding the best results). The work offers a practical approach to efficient reasoning with LLMs, enabling faster inference and lower costs without sacrificing performance, and suggests future directions for more accurate advantage estimation with fewer group samples.

Abstract

Paper Structure (10 sections, 9 equations, 3 figures, 3 tables)

This paper contains 10 sections, 9 equations, 3 figures, 3 tables.

Introduction
Preliminaries
Methodology
Experiments
Experiment Settings
Main Results
Self-Reflection Results
Results on Eliminating the Two Limitations of GRPO
Ablation Experiments
Conclusion

Figures (3)

Figure 1: A case study with ZR1-1.5B on MATH500 dataset, comparing Vanilla, GRPO and FGO methods.
Figure 2: The Self-Reflection keywords count.
Figure 3: GRPO and FGO training curves on Qwen2.5-Math-1.5B Model, including reward, length and entropy.

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

TL;DR

Abstract

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (3)