BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang
TL;DR
BoostStep addresses the bottleneck of mathematical reasoning in large language models by shifting from problem-level to step-level, step-aligned guidance during reasoning. It introduces a step-level example bank and a first-try retrieval strategy to select highly relevant exemplars for the current step, improving per-step correctness and reducing distraction from irrelevant steps. The approach integrates with chain-of-thought and step-level tree search, delivering substantial gains across GPT-4o and multi-modal math benchmarks, and demonstrates robustness to low similarity between problems and exemplars. Overall, BoostStep offers a practical, flexible pathway to enhance mathematical capabilities of LLMs using simpler, targeted exemplars and proven reasoning frameworks.
Abstract
Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
