Table of Contents
Fetching ...

DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Gang Li, Yan Chen, Ming Lin, Tianbao Yang

TL;DR

The paper addresses the inefficiency of reasoning in large reasoning models by showing that incorporating length penalties into GRPO can produce negative learning signals for correct but verbose outputs. It introduces Decoupled Reward Policy Optimization (DRPO), which decouples positive and negative learning signals within a discriminative framework (DisCO) and integrates a length-rewarded positive-data distribution under KL regularization. A closed-form solution for the optimized positive distribution enables efficient, on-policy optimization using importance weighting. Empirical results on mathematical reasoning benchmarks demonstrate substantial reductions in generated length with minimal accuracy loss, outperforming six baselines across 1.5B and 7B models and highlighting DRPO’s potential for efficient, scalable reasoning.

Abstract

Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long and redundant reasoning even for simple questions, which substantially increases computational cost and response latency. While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO's group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning. To overcome this, we propose Decoupled Reward Policy Optimization (DRPO), a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO ensures that reward signals for correct rollouts are normalized solely within the positive group, shielding them from interference by negative samples. The DRPO's objective is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution, enabling efficient computation of the objective and its gradients using only on-policy data and importance weighting. Of independent interest, this formulation is general and can incorporate other preference rewards of positive data beyond length. Experiments on mathematical reasoning tasks demonstrate DRPO's significant superiority over six efficient reasoning baselines. Notably, with a 1.5B model, our method achieves 77\% length reduction with only 1.1\% performance loss on simple questions like GSM8k dataset, while the follow-up baseline sacrifices 4.3\% for 68\% length reduction.

DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

TL;DR

The paper addresses the inefficiency of reasoning in large reasoning models by showing that incorporating length penalties into GRPO can produce negative learning signals for correct but verbose outputs. It introduces Decoupled Reward Policy Optimization (DRPO), which decouples positive and negative learning signals within a discriminative framework (DisCO) and integrates a length-rewarded positive-data distribution under KL regularization. A closed-form solution for the optimized positive distribution enables efficient, on-policy optimization using importance weighting. Empirical results on mathematical reasoning benchmarks demonstrate substantial reductions in generated length with minimal accuracy loss, outperforming six baselines across 1.5B and 7B models and highlighting DRPO’s potential for efficient, scalable reasoning.

Abstract

Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long and redundant reasoning even for simple questions, which substantially increases computational cost and response latency. While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO's group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning. To overcome this, we propose Decoupled Reward Policy Optimization (DRPO), a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO ensures that reward signals for correct rollouts are normalized solely within the positive group, shielding them from interference by negative samples. The DRPO's objective is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution, enabling efficient computation of the objective and its gradients using only on-policy data and importance weighting. Of independent interest, this formulation is general and can incorporate other preference rewards of positive data beyond length. Experiments on mathematical reasoning tasks demonstrate DRPO's significant superiority over six efficient reasoning baselines. Notably, with a 1.5B model, our method achieves 77\% length reduction with only 1.1\% performance loss on simple questions like GSM8k dataset, while the follow-up baseline sacrifices 4.3\% for 68\% length reduction.

Paper Structure

This paper contains 17 sections, 13 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the limitation of GRPO with length penalty and the benefit of our approach. Suppose $[1, 1, 1, 0, 0, 0]$ are the accuracy rewards for 6 responses, and $[0.73, 0.6, 0.2, 0, 0, 0]$ are the rewards after applying the length penalty to correct answers. Using the group-relative advantage calculation of GRPO, the advantages for the third response shift from 1 (without length penalty) to -0.17 (with length penalty added), inadvertently penalizing the third correct response, which may substantially harm performance. In contrast, our proposed DRPO reduces the learning signal for lengthy and correct responses but never pushes them to the negative territory.
  • Figure 2: Training dynamics of DRPO with different regularization weights $\lambda$. The left two plots are for fine-tuning the 1.5B model, and the right two are for fine-tuning the 7B model. $\lambda=+\infty$ denotes the reference method DisCO, which does not incorporate length rewards in training.
  • Figure 3: Comparison of performance-efficiency trade-off. Left is for fine-tuning 1.5B model and right is for fine-tuning 7B model. Grey lines represent the base model performance before finetuning, with generation length of 4698 for 1.5B model and 4119 for 7B model. Squares denote models trained with reference methods without length penalties (i.e., $\lambda$=+$\infty$ for DRPO, $\alpha=0$ for RLOO-LP, $\beta=0$ for ALP, $w=0$ for HAPO). Triangles denote the models trained by other works.
  • Figure 4: Performance-efficiency tradeoff on individual datasets with increasing difficulty levels from left to right. (a) is for finetuning 1.5B model and (b) is for finetuning 7B model. Squares denote models trained with reference methods without length penalties (i.e., $\lambda$=+$\infty$ for DRPO, $\alpha=0$ for RLOO-LP, $\beta=0$ for ALP, $w=0$ for HAPO).
  • Figure 5: Example reasoning for Prompt 1 from DisCo (DRPO $\lambda = +\infty$) and DRPO ($\lambda = 0.1$). Words with green color are answers, and with blue color are reflection words. DRPO reaches the correct answer with clear reasoning in only 89 tokens, achieving a 6× reduction compared to the 526 tokens required by DisCO.
  • ...and 2 more figures