Table of Contents
Fetching ...

Test-time Scaling of LLMs: A Survey from A Subproblem Structure Perspective

Zhuoyi Yang, Xu Guo, Tong Zhang, Huijuan Xu, Boyang Li

TL;DR

The paper tackles how to improve LLM/VLM inference accuracy by spending more compute at test time without altering model weights. It introduces a unifying perspective based on subproblem structure (sequential, parallel, tree) to categorize TTS methods, including CoT, ToT, and related hybrids. The authors survey task decomposition strategies (human-only and LLM-assisted) and analyze reasoning-path approaches across unimodal, multimodal, and RAG settings, detailing strengths, weaknesses, and mitigation techniques. They conclude with promising directions—meta-reasoning, efficient multi-path exploration, and extending tree-based reasoning to multimodal/RAG systems—that could yield more efficient, robust, and generalizable reasoning systems.

Abstract

With this paper, we survey techniques for improving the predictive accuracy of pretrained large language models by allocating additional compute at inference time. In categorizing test-time scaling methods, we place special emphasis on how a problem is decomposed into subproblems and on the topological organization of these subproblems whether sequential, parallel, or tree-structured. This perspective allows us to unify diverse approaches such as Chain-of-Thought, Branch-Solve-Merge, and Tree-of-Thought under a common lens. We further synthesize existing analyses of these techniques, highlighting their respective strengths and weaknesses, and conclude by outlining promising directions for future research

Test-time Scaling of LLMs: A Survey from A Subproblem Structure Perspective

TL;DR

The paper tackles how to improve LLM/VLM inference accuracy by spending more compute at test time without altering model weights. It introduces a unifying perspective based on subproblem structure (sequential, parallel, tree) to categorize TTS methods, including CoT, ToT, and related hybrids. The authors survey task decomposition strategies (human-only and LLM-assisted) and analyze reasoning-path approaches across unimodal, multimodal, and RAG settings, detailing strengths, weaknesses, and mitigation techniques. They conclude with promising directions—meta-reasoning, efficient multi-path exploration, and extending tree-based reasoning to multimodal/RAG systems—that could yield more efficient, robust, and generalizable reasoning systems.

Abstract

With this paper, we survey techniques for improving the predictive accuracy of pretrained large language models by allocating additional compute at inference time. In categorizing test-time scaling methods, we place special emphasis on how a problem is decomposed into subproblems and on the topological organization of these subproblems whether sequential, parallel, or tree-structured. This perspective allows us to unify diverse approaches such as Chain-of-Thought, Branch-Solve-Merge, and Tree-of-Thought under a common lens. We further synthesize existing analyses of these techniques, highlighting their respective strengths and weaknesses, and conclude by outlining promising directions for future research

Paper Structure

This paper contains 15 sections, 2 figures.

Figures (2)

  • Figure 1: Test-time scaling methods categorized by reasoning path and decomposition method.
  • Figure 2: An illustration of the three subtasks structures. In the sequential structure (a), there may be branching decision points, but we only visit one path leading to the answer without backtracking. In the parallel structure (b), we apply several paths simultaneously and aggregate the answers from all paths. In the tree structure (c), we may dynamically decide to jump from one path to another, but eventually find only one path to the answer.