Table of Contents
Fetching ...

Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty

Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo, Yuan Cheng

TL;DR

This work addresses efficient reasoning in large language models by making outputs concise for easy questions while preserving depth for hard ones. It introduces a Powered Length Penalty (PLP) integrated into reinforcement learning (REINFORCE with RLOO) to couple length with question difficulty in the reward, encouraging brevity on simple problems and full reasoning on challenging ones. Empirical results on GSM8K, MATH500, and AIME2024 show substantial token reductions with maintained or improved accuracy on easier tasks and gains on harder tasks across multiple model variants. The approach offers a practical path to faster, resource-efficient reasoning without sacrificing correctness, with potential applicability beyond math benchmarks.

Abstract

Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem's complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model's overall performance. Specifically, we manage the model's reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.

Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty

TL;DR

This work addresses efficient reasoning in large language models by making outputs concise for easy questions while preserving depth for hard ones. It introduces a Powered Length Penalty (PLP) integrated into reinforcement learning (REINFORCE with RLOO) to couple length with question difficulty in the reward, encouraging brevity on simple problems and full reasoning on challenging ones. Empirical results on GSM8K, MATH500, and AIME2024 show substantial token reductions with maintained or improved accuracy on easier tasks and gains on harder tasks across multiple model variants. The approach offers a practical path to faster, resource-efficient reasoning without sacrificing correctness, with potential applicability beyond math benchmarks.

Abstract

Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem's complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model's overall performance. Specifically, we manage the model's reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.

Paper Structure

This paper contains 18 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of model outputs under the same prompt. Our method produces a shorter yet accurate response, demonstrating more efficient reasoning.
  • Figure 2: Difference between original RL and ours. The same color represents the question corresponding to the answer.In our method, the model adapts the length penalty according to the difficulty of the problem: a high penalty is imposed for simple tasks to encourage concise responses, whereas the penalty is minimized for complex tasks to permit more comprehensive answers. Incorrect outputs receive 0 reward irrespective of response length.
  • Figure 3: Comparison between standardized and absolute length penalty methods across two example ranges: 300–600 and 7,000–10,000 tokens. Blue indicates the standardized method, while red denotes the absolute method.
  • Figure 4: Difference between big reward and small reward when the last sample is incorrect.
  • Figure 5: Difference between our method and the efficient method. For our method, the coefficients are 1, 2, 3, 4, 5, 20, 30, while for the efficient method, the coefficients are 0.05, 0.1, 0.2, 0.4.
  • ...and 1 more figures