Table of Contents
Fetching ...

Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

Gaurav Singh, Abhishek Dey, Janit Bidhan, Tanu Kansal, Paras Kath, Saurabh Srivastava

TL;DR

This work investigates batch prompting as an inference-time regularizer for large reasoning models, showing that batching reduces overthinking and reasoning tokens while preserving accuracy across 13 benchmarks and two models. By amortizing fixed prompt costs and limiting per-query reasoning under batch context, the approach achieves about a 74% reduction in reasoning tokens with minimal accuracy loss, and exhibits emergent phenomena like pattern induction and hedging suppression. Explicit prompt-based constraints are shown to be ineffective, highlighting batch prompting as a robust, prompt-only solution. The findings offer a practical, model-agnostic method to improve efficiency in latency- and cost-sensitive reasoning deployments.

Abstract

Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.

Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

TL;DR

This work investigates batch prompting as an inference-time regularizer for large reasoning models, showing that batching reduces overthinking and reasoning tokens while preserving accuracy across 13 benchmarks and two models. By amortizing fixed prompt costs and limiting per-query reasoning under batch context, the approach achieves about a 74% reduction in reasoning tokens with minimal accuracy loss, and exhibits emergent phenomena like pattern induction and hedging suppression. Explicit prompt-based constraints are shown to be ineffective, highlighting batch prompting as a robust, prompt-only solution. The findings offer a practical, model-agnostic method to improve efficiency in latency- and cost-sensitive reasoning deployments.

Abstract

Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.

Paper Structure

This paper contains 10 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Results for DeepSeek-R1 and OpenAI o1 (averaged over 13 datasets) demonstrate that batching does not hurt reasoning performance while dramatically reducing reasoning tokens. Increasing batch size maintains Average Accuracy within a narrow +2.44% range, yet achieves a 74.2% reduction in Average Reasoning Token Count.
  • Figure 2: Overview of the batch prompting mechanism. A batch of test queries ($Q=[Q_1,Q_2,\ldots,Q_N]$) is prepared as a single prompt and sent to the model ($\mathcal{M}$) for joint response generation. The resulting outputs ($R=[R_1,R_2,\ldots,R_N]$) share a common reasoning context, effectively distributing attention and computation across all items in the batch.
  • Figure 3: Breakdown of total, reasoning, and output token counts for DeepSeek-R1 and OpenAI-o1. Increasing batch size consistently reduces total token consumption, mainly through a decline in reasoning tokens while output token counts remain largely unaffected, highlighting batching as an effective inference-time efficiency strategy.