Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models
Xiao-Wen Yang, Xuan-Yi Zhu, Wen-Da Wei, Ding-Chu Zhang, Jie-Jing Shao, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li
TL;DR
The paper tackles inefficiencies in slow-thinking LLMs by proposing Self-Backtracking, a framework that internalizes the search process so the model learns when and where to backtrack during training and applies it during inference to perform dynamic, memory-efficient reasoning. It introduces a dual-dataset training regime (D_op and D_back) and a backtracking-driven inference loop with Expansion, Backtracking, and Selection, complemented by an expert-iteration-based self-improvement loop. Empirical results on the Countdown task show substantial gains over optimal-path SFT (over 40%), with the method benefiting from backtracking budget and test-time scaling, and evidence of sustained improvement through expert iteration. The work suggests a promising direction for advancing LLM reasoning toward more robust, self-improving Reasoners, while also outlining limitations and future work for broader generalization and scaling.
Abstract
The integration of slow-thinking mechanisms into large language models (LLMs) offers a promising way toward achieving Level 2 AGI Reasoners, as exemplified by systems like OpenAI's o1. However, several significant challenges remain, including inefficient overthinking and an overreliance on auxiliary reward models. We point out that these limitations stem from LLMs' inability to internalize the search process, a key component of effective reasoning. A critical step toward addressing this issue is enabling LLMs to autonomously determine when and where to backtrack, a fundamental operation in traditional search algorithms. To this end, we propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference. This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement. Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40 percent compared to the optimal-path supervised fine-tuning method. We believe this study introduces a novel and promising pathway for developing more advanced and robust Reasoners.
