Table of Contents
Fetching ...

To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning

Tian Qin, David Alvarez-Melis, Samy Jelassi, Eran Malach

TL;DR

This paper investigates whether sequential backtracking vs parallel best-of-$n$ sampling optimally scales test-time compute for LLM reasoning. Using CountDown and Sudoku as controlled benchmarks, it compares backtracking models trained on explicit DFS traces with direct-solution models trained on correct solutions, evaluating under fixed compute budgets. It shows that backtracking is not universally superior: it underperforms in CountDown but outperforms in Sudoku, with backtracking performance highly sensitive to task structure, model size, and training paradigm. RL fine-tuning with GRPO generally enhances backtracking by enabling discovery of novel search strategies, while direct-solution models may gain in one-shot accuracy but lose diversity, reducing scalability under parallel search. The results highlight a nuanced landscape where the choice between sequential and parallel search should consider task depth, data biases, and potential RL benefits, with implications for designing reasoning systems that mix or adapt strategies by problem context.

Abstract

Recent advancements in large language models (LLMs) have significantly improved their reasoning abilities, particularly through techniques involving search and backtracking. Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation. However, this is not the only strategy for scaling test time-compute: parallel sampling with best-of-N selection provides an alternative that generates diverse solutions simultaneously. Despite the growing adoption of sequential search, its advantages over parallel sampling-especially under a fixed compute budget-remain poorly understood. In this paper, we systematically compare these two approaches on two challenging reasoning tasks: CountDown and Sudoku. Surprisingly, we find that sequential search underperforms parallel sampling on CountDown but outperforms it on Sudoku, suggesting that backtracking is not universally beneficial. We identify two factors that can cause backtracking to degrade performance: (1) training on fixed search traces can lock models intro suboptimal strategies, and (2) explicit CoT supervision can discourage implicit (non verbalized) reasoning. Extending our analysis to reinforcement learning (RL), we show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains. Together, these findings challenge the assumption that backtracking universally enhances LLM reasoning, instead revealing a complex interaction between task structure, training data, model scale, and learning paradigm.

To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning

TL;DR

This paper investigates whether sequential backtracking vs parallel best-of- sampling optimally scales test-time compute for LLM reasoning. Using CountDown and Sudoku as controlled benchmarks, it compares backtracking models trained on explicit DFS traces with direct-solution models trained on correct solutions, evaluating under fixed compute budgets. It shows that backtracking is not universally superior: it underperforms in CountDown but outperforms in Sudoku, with backtracking performance highly sensitive to task structure, model size, and training paradigm. RL fine-tuning with GRPO generally enhances backtracking by enabling discovery of novel search strategies, while direct-solution models may gain in one-shot accuracy but lose diversity, reducing scalability under parallel search. The results highlight a nuanced landscape where the choice between sequential and parallel search should consider task depth, data biases, and potential RL benefits, with implications for designing reasoning systems that mix or adapt strategies by problem context.

Abstract

Recent advancements in large language models (LLMs) have significantly improved their reasoning abilities, particularly through techniques involving search and backtracking. Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation. However, this is not the only strategy for scaling test time-compute: parallel sampling with best-of-N selection provides an alternative that generates diverse solutions simultaneously. Despite the growing adoption of sequential search, its advantages over parallel sampling-especially under a fixed compute budget-remain poorly understood. In this paper, we systematically compare these two approaches on two challenging reasoning tasks: CountDown and Sudoku. Surprisingly, we find that sequential search underperforms parallel sampling on CountDown but outperforms it on Sudoku, suggesting that backtracking is not universally beneficial. We identify two factors that can cause backtracking to degrade performance: (1) training on fixed search traces can lock models intro suboptimal strategies, and (2) explicit CoT supervision can discourage implicit (non verbalized) reasoning. Extending our analysis to reinforcement learning (RL), we show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains. Together, these findings challenge the assumption that backtracking universally enhances LLM reasoning, instead revealing a complex interaction between task structure, training data, model scale, and learning paradigm.

Paper Structure

This paper contains 73 sections, 4 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Backtracking performance varies significantly with task type and the application of post-training reinforcement learning. (A) Training backtracking and direct solution models on CountDown and Sudoku reveals task-dependent performance: under equal test-time compute, backtracking (sequential search) underperforms direct solution with best-of-$n$ generation (parallel search) on CountDown, but outperforms it on Sudoku. (B) Fine-tuning with GRPO consistently improves backtracking model performance across compute budgets, but has mixed effects on the direct solution model.
  • Figure 2: Backtracking and direct solution for two different strategic games.Panel (a, b): Example the search tree for CountDown and Sudoku. Solving both games require extensive search in the solution space. Panel (c): The backtracking model is trained on the search traces generated by a Depth-First-Search (DFS) algorithm. At test time, the model performs sequential search. The direct solution model is trained on the correct solution only. At test time, the model performs parallel search through temperature sampling and takes best-of-$n$.
  • Figure 3: Backtracking and direct solution models implement different search strategies for CountDown. For test questions that model solves correctly, we measure the number of mistakes made (i.e., wrong terminal nodes visited) before finding the correct solution. We sort the test questions by number of mistakes made by DFS. Left: Trained on DFS traces, the number of mistakes made by the backtracking model correlates with the DFS. Middle: In contrast, the direct solution model solves a lot more problems with significantly fewer mistakes compared to DFS. Right: For a given number of mistakes made, we examine whether two models solve the same set of question as DFS. Direct solution model implements a search strategy significantly different from DFS.
  • Figure 4: Two different variations to improve backtracking model.(a). We hypothesize that the backtracking model can think one step ahead without sacrificing its ability to search. Therefore, we shorten the search trace by skipping the last search step. (b). Two data variations that improve the backtracking model. Mixed-backtrack model trained on a diverse set of search strategies. Think-backtracking model trained on shortened DFS trace.
  • Figure 5: Different scaling behaviors for backtracking versus direct solution model. CountDown(A). Backtracking model performance does not improve as we scale up model size. (B). The direct solution model improves (C). Direct solution model consistently outperforms backtracking model. Sudoku(D, E). Both models' performances improve as we scale up model size. (F). Direct solution model consistently underperforms backtracking model.
  • ...and 10 more figures