Table of Contents
Fetching ...

Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou

TL;DR

Plan-and-Budget is proposed, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling and improves reasoning efficiency across a range of tasks and models.

Abstract

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent work has tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it improves the efficiency of a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.

Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

TL;DR

Plan-and-Budget is proposed, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling and improves reasoning efficiency across a range of tasks and models.

Abstract

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent work has tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it improves the efficiency of a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.

Paper Structure

This paper contains 31 sections, 35 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Illustration of Reasoning Miscalibration. Vanilla reasoning results in overthinking and wastes tokens; global budgeting results in underthinking and fails. Our method combines planning and local budgeting to guide structured, efficient reasoning, achieving the correct answer with fewer tokens.
  • Figure 2: Visualization of decay functions $\rho$. We take $B=100,\ p=2,\ \gamma=0.9$, and 5 sub-questions with the same complexity as an example.
  • Figure 3: Answer pass rate (%) grouped by the question difficulty level, with legend showing the overall pass rate (%) and average token usage. The global budget limit hurts the pass rate on all levels, while our method not only achieves a higher pass rate but also enjoys lower token usage.
  • Figure 4: Token usage and pass rate analysis across difficulty levels on TravelPlanner. (Left) Token usage distributions. (Right) Answer pass rate (%) by difficulty level.