Table of Contents
Fetching ...

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang

TL;DR

BoostStep addresses the bottleneck of mathematical reasoning in large language models by shifting from problem-level to step-level, step-aligned guidance during reasoning. It introduces a step-level example bank and a first-try retrieval strategy to select highly relevant exemplars for the current step, improving per-step correctness and reducing distraction from irrelevant steps. The approach integrates with chain-of-thought and step-level tree search, delivering substantial gains across GPT-4o and multi-modal math benchmarks, and demonstrates robustness to low similarity between problems and exemplars. Overall, BoostStep offers a practical, flexible pathway to enhance mathematical capabilities of LLMs using simpler, targeted exemplars and proven reasoning frameworks.

Abstract

Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

TL;DR

BoostStep addresses the bottleneck of mathematical reasoning in large language models by shifting from problem-level to step-level, step-aligned guidance during reasoning. It introduces a step-level example bank and a first-try retrieval strategy to select highly relevant exemplars for the current step, improving per-step correctness and reducing distraction from irrelevant steps. The approach integrates with chain-of-thought and step-level tree search, delivering substantial gains across GPT-4o and multi-modal math benchmarks, and demonstrates robustness to low similarity between problems and exemplars. Overall, BoostStep offers a practical, flexible pathway to enhance mathematical capabilities of LLMs using simpler, targeted exemplars and proven reasoning frameworks.

Abstract

Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
Paper Structure (21 sections, 4 figures, 7 tables)

This paper contains 21 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Our step-aligned in-context learning (ICL) outperforms traditional problem-level few-shot learning for about 4% across in-domain, out-domain and cross-modality mathematical benchmark on GPT4o. Moreover, on benchmarks with lower similarity with the reference problem set (i.e. OlympiadBench and multi-modal benchmarks), where problem-level ICL may have a negative impact, BoostStep still provides valuable guidance.
  • Figure 2: Our strategy refines in-context learning from problem-level granularity (fig.a) to step-level granularity(fig.b) to provide more real-time fine-grained guidance. Moreover, our strategy can guide the reasoning and verifying process in tree-searching strategies by introducing examples.
  • Figure 3: Different problems may contain similar steps. Problem-level in-context learning will ignore this example due to low problem similarity. In contrast, our step-level in-context learning strategy can introduce the core skills by step-level retrieval and guidance.
  • Figure 4: A specific example of adjusting reasoning during real-time inference through step-level in-context learning. The first try uses a wrong equation while the retrieving example step guides the model to use the correct equation and get the correct conclusion.