Table of Contents
Fetching ...

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

Kaiwen Yan, Xuanqing Shi, Hongcheng Guo, Wenxuan Wang, Zhuosheng Zhang, Chengwei Qin

TL;DR

DRQA addresses the overthinking tendency of reasoning large language models by transferring batch-inference resource-competition effects to single-question inference. It combines supervised fine-tuning with batch-derived data and a reinforcement-learning framework to train models to allocate reasoning resources adaptively, favoring concise and accurate chains of thought. Across math and scientific benchmarks, DRQA reduces token usage by over 30% while maintaining or improving accuracy, and it generalizes to code-generation tasks like LiveCodeBench. The work highlights a practical, scalable path to more efficient RLLMs through fine-grained control of reasoning depth based on problem difficulty.

Abstract

Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

TL;DR

DRQA addresses the overthinking tendency of reasoning large language models by transferring batch-inference resource-competition effects to single-question inference. It combines supervised fine-tuning with batch-derived data and a reinforcement-learning framework to train models to allocate reasoning resources adaptively, favoring concise and accurate chains of thought. Across math and scientific benchmarks, DRQA reduces token usage by over 30% while maintaining or improving accuracy, and it generalizes to code-generation tasks like LiveCodeBench. The work highlights a practical, scalable path to more efficient RLLMs through fine-grained control of reasoning depth based on problem difficulty.

Abstract

Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.

Paper Structure

This paper contains 36 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison between batch inference and single-question inference using Deepseek-R1. Answering three questions together results in significantly fewer tokens than answering each question individually.
  • Figure 2: Impact of batch size on output length and accuracy (DeepSeek-R1).
  • Figure 3: The pipeline of Dynamic Reasoning Quota Allocation (DRQA). Batched questions are input to LLM, producing reasoning chains labeled as A/B/C. Reinforcement learning trains the model to prefer concise and accurate reasoning for efficient resource allocation.
  • Figure 4: The efficiency-accuracy trade-off on GPQA-diamond for DRQA and ablation variants.