Table of Contents
Fetching ...

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, Dimitris Papailiopoulos

TL;DR

The paper targets length inflation in reinforcement learning from verifier rewards used for complex reasoning. It introduces Group Filtered Policy Optimization (GFPO), which expands the candidate output pool per prompt and trains only on the top-k candidates selected by a target metric such as length or token efficiency, effectively shaping the policy toward concise reasoning. An Adaptive Difficulty variant further tunes the retained set based on real-time problem difficulty. Empirical results on Phi-4-reasoning and benchmarks like AIME, GPQA, Omni-MATH, and LiveCodeBench show GFPO reduces excess output length by up to ~85% while preserving or improving accuracy, revealing a favorable training-time vs inference-time compute trade-off and improved out-of-distribution robustness.

Abstract

Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length--inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely "filler": repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute--a simple yet effective trade-off for efficient reasoning.

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

TL;DR

The paper targets length inflation in reinforcement learning from verifier rewards used for complex reasoning. It introduces Group Filtered Policy Optimization (GFPO), which expands the candidate output pool per prompt and trains only on the top-k candidates selected by a target metric such as length or token efficiency, effectively shaping the policy toward concise reasoning. An Adaptive Difficulty variant further tunes the retained set based on real-time problem difficulty. Empirical results on Phi-4-reasoning and benchmarks like AIME, GPQA, Omni-MATH, and LiveCodeBench show GFPO reduces excess output length by up to ~85% while preserving or improving accuracy, revealing a favorable training-time vs inference-time compute trade-off and improved out-of-distribution robustness.

Abstract

Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length--inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely "filler": repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute--a simple yet effective trade-off for efficient reasoning.

Paper Structure

This paper contains 12 sections, 7 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: Left: GFPO introduces simple yet powerful modifications to GRPO: sample more responses during training ($\uparrow G$), rank them by a target attribute (e.g., length, token efficiency), and learn only from the top-$k$—setting the advantages of the rest to zero. This selective learning functions as implicit reward shaping, steering the policy toward desired behaviors. Right: When optimizing for length or token efficiency, GFPO curbs GRPO’s length inflation—letting the model think less at inference-time by sampling more at training-time—while maintaining its core reasoning capabilities.
  • Figure 3: Pareto Trade-off Between Accuracy and Response Length. For all benchmarks except AIME 25, at least one GFPO variant strictly dominates GRPO—achieving both higher accuracy and shorter responses (green region above and to the left of GRPO). For AIME 25, GRPO attains the highest accuracy, but several GFPO variants, while taking non-significant accuracy dips, remain Pareto-optimal because their responses are shorter, and no other method is simultaneously more accurate and more concise. On average, Shortest 4/24, Adaptive Difficulty, and Shortest 8/16 are strictly Pareto-superior to GRPO with Token Efficiency close behind.
  • Figure 4: Average Response Length vs k/G.$\downarrow k/G \rightarrow \downarrow L_{avg}$ but beyond a point leads to diminishing returns.
  • Figure 5: Average Length and Excess Length Reductions Across Problem Difficulties. Response lengths rise with problem difficulty for all methods (left), but GFPO reduces excess length across all problem difficulty levels (right). Token efficiency has the most significant reductions--with responses more brief than even the SFT baseline on easy questions. Shortest 8/24 has the strongest reductions on very hard questions.
  • Figure 6: GFPO Accuracy Across Difficulty Levels and Response Lengths.(a) Accuracy Across Problem Difficulties shows Adaptive Difficulty matching or exceeding GRPO accuracy on easy, medium, and very hard questions with Shortest 8/24 matching GRPO on hard problems via larger $G$. (b) Accuracy, Response Share, and Prompt Difficulty by Response Length shows GFPO cuts long-tail verbosity (32% to 22% outputs $\geq20k$) and solves hard problems with shorter responses ($\sim$9x harder prompts solved with $\leq5k$ tokens).
  • ...and 6 more figures