Table of Contents
Fetching ...

Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG

Zhichao Wang, Cheng Wan, Dong Nie

TL;DR

This paper surveys the shift from traditional pre-training scaling to inference-time scaling for large language models, a shift driven by limited access to high-quality training data. It organizes techniques into two main pillars: output-focused methods (reasoning, search, decoding, long CoT training, multimodal reasoning, and model ensembles) and input-focused methods (RAG and Few-Shot prompting). The review details how these methods leverage computation at inference—via structured reasoning paths, diverse decoding, and external tools, or via retrieval-augmented inputs and memory-driven prompts—to improve task performance without retraining. Key contributions include a structured taxonomy of inference-time techniques, synthesis of advances across single-model, multi-model, and multimodal settings, and insights into practical deployment considerations such as data chunking, graph-based representations, and collaborative RAG. The work highlights the practical impact of inference-time scaling on efficiency and accuracy, and it outlines directions for future work in scalable, responsible, and multi-modal LLM systems.

Abstract

The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.

Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG

TL;DR

This paper surveys the shift from traditional pre-training scaling to inference-time scaling for large language models, a shift driven by limited access to high-quality training data. It organizes techniques into two main pillars: output-focused methods (reasoning, search, decoding, long CoT training, multimodal reasoning, and model ensembles) and input-focused methods (RAG and Few-Shot prompting). The review details how these methods leverage computation at inference—via structured reasoning paths, diverse decoding, and external tools, or via retrieval-augmented inputs and memory-driven prompts—to improve task performance without retraining. Key contributions include a structured taxonomy of inference-time techniques, synthesis of advances across single-model, multi-model, and multimodal settings, and insights into practical deployment considerations such as data chunking, graph-based representations, and collaborative RAG. The work highlights the practical impact of inference-time scaling on efficiency and accuracy, and it outlines directions for future work in scalable, responsible, and multi-modal LLM systems.

Abstract

The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.

Paper Structure

This paper contains 50 sections, 9 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: From the output side, different techniques will be discussed, including: 1) reasoning methods like CoT, ToT, and ReAct; 2) search methods like MCTS and beam search; 3) decoding methods like Best-of-N, speculative decoding, and constrained decoding; 4) training for long CoT like RLVR and GRPO; 5) multi-modal reasoning; and 6) model ensemble. For the input side, it is further divided into RAG and Few-Shot. In RAG, it will be discussed from the perspectives of: 1) query expansion, 2) data, 3) retrieval and reranker, 4) LLM generation, and 5) multi-modal RAG.
  • Figure 2: CoT: LLM is asked to generate chain of thought before generating the final answer with few-shot examples of prompt, CoT and answer. For zero-shot case, it will use "Let’s think step by step" to encourage LLM to think before generating the final answer.
  • Figure 3: Comparison among: 1. CoT, 2. SC, 3. ToT, and 4. GoT
  • Figure 4: (a). Self-Refine: the same LLM is utilized for generating the response and providing the feedback, (b). Reflexion: iterative reasoning with memory
  • Figure 5: Program aided language model transforms the natural language problem into Python programs so that the results are obtained from the execution of the Python program.
  • ...and 16 more figures