Table of Contents
Fetching ...

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Yinan Xia, Haotian Zhang, Huiming Wang

Abstract

Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Abstract

Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.
Paper Structure (19 sections, 2 theorems, 10 equations, 5 figures, 9 tables)

This paper contains 19 sections, 2 theorems, 10 equations, 5 figures, 9 tables.

Key Result

Lemma 1

The expected accuracy $E(f(l))$ can be maximized $f(l^*)$ when $\mu = l^*$.

Figures (5)

  • Figure 1: The relationship between length and difficulties for different LRMs. The output length exhibits a non-monotonic relationship with problem difficulty: it first increases and then decreases as the difficulty decreases. For overly difficult problems, the LRMs generate shorter but incorrect answers, which we define as the overconfidence phenomenon.
  • Figure 2: (a) Output length distribution and accuracy as a function of length on easy tasks before RL training. (b) Length distribution and length-wise accuracy after GRPO ( blue) and DDPO ( green) training. After training, DDPO yields a left-shifted and more concentrated length distribution compared to GRPO, with samples clustered around the optimal length (corresponding to the highest accuracy), resulting in higher accuracy. (c) Output length distribution and accuracy as a function of length on hard tasks before RL training. (d) Length distribution and length-wise accuracy after GRPO ( red) and DDPO (brick red) training. After training, DDPO exhibits a clear rightward shift relative to GRPO, with a more concentrated distribution centered around the optimal length, leading to improved accuracy.
  • Figure 3: (a) We define the rollout accuracy as an indicator of the difficulty for each query. By analyzing the relationship between difficulty and output length (shown in the lower-right panel), we observe a inverted U-shaped pattern, which we define as "overconfidence". Based on this observation, we categorize queries into hard and easy ones and propose Difficulty-Differentiated Policy Optimization (DDPO), which encourages greater exploration for difficult queries while shrinking output length for easy queries to avoid redundancy. (b) When applying the length optimization strategy, we use the optimal length as a reference and guide the final length distribution to be closer to this value and more concentrated, thereby increasing the expected accuracy. (c) We estimate the optimal length as the average length of correct answers for queries of the same difficulty within the batch.
  • Figure 4: The change in the number of samples with an accuracy of 0 and 1 during the training process of GRPO and DDPO. Throughout DDPO training, the number of samples with an accuracy of 0 is consistently lower than that in GRPO, while the number of samples with an accuracy of 1 consistently exceeds that in GRPO.
  • Figure 5: A comparison of the length distributions for different difficulty levels between GRPO and DDPO. The result shows that DDPO results in a more concentrated length distribution and significantly alleviates the overconfidence phenomenon.

Theorems & Definitions (2)

  • Lemma 1
  • Lemma 2