Table of Contents
Fetching ...

Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning

Gisang Lee, Sangwoo Park, Junyoung Park, Andrew Chung, Sieun Park, Yoonah Park, Byungju Kim, Min-gyu Cho

TL;DR

This study performs an experimental analysis of distinct prompting methods within the domain of mathematical reasoning to demonstrate that each method explores a distinct search space, and this differentiation becomes more evident with increasing problem complexity.

Abstract

Large Language Models (LLMs) have exhibited remarkable capabilities in many complex tasks including mathematical reasoning. However, traditional approaches heavily rely on ensuring self-consistency within single prompting method, which limits the exploration of diverse problem-solving strategies. This study addresses these limitations by performing an experimental analysis of distinct prompting methods within the domain of mathematical reasoning. Our findings demonstrate that each method explores a distinct search space, and this differentiation becomes more evident with increasing problem complexity. To leverage this phenomenon, we applied efficient sampling process that uniformly combines samples from these diverse methods, which not only expands the maximum search space but achieves higher performance with fewer runs compared to single methods. Especially, within the subset of difficult questions of MATH dataset named MATH-hard, The maximum search space was achieved while utilizing approximately 43% fewer runs than single methods on average. These findings highlight the importance of integrating diverse problem-solving strategies to enhance the reasoning abilities of LLMs.

Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning

TL;DR

This study performs an experimental analysis of distinct prompting methods within the domain of mathematical reasoning to demonstrate that each method explores a distinct search space, and this differentiation becomes more evident with increasing problem complexity.

Abstract

Large Language Models (LLMs) have exhibited remarkable capabilities in many complex tasks including mathematical reasoning. However, traditional approaches heavily rely on ensuring self-consistency within single prompting method, which limits the exploration of diverse problem-solving strategies. This study addresses these limitations by performing an experimental analysis of distinct prompting methods within the domain of mathematical reasoning. Our findings demonstrate that each method explores a distinct search space, and this differentiation becomes more evident with increasing problem complexity. To leverage this phenomenon, we applied efficient sampling process that uniformly combines samples from these diverse methods, which not only expands the maximum search space but achieves higher performance with fewer runs compared to single methods. Especially, within the subset of difficult questions of MATH dataset named MATH-hard, The maximum search space was achieved while utilizing approximately 43% fewer runs than single methods on average. These findings highlight the importance of integrating diverse problem-solving strategies to enhance the reasoning abilities of LLMs.

Paper Structure

This paper contains 18 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Line graph of maximum search space's accuracy achieved by sampling 21 runs per methods. The three grey horizontal lines represent the upper bound values within a single method. The star markers indicate the points at which these upper bound values were achieved using our proposed Uniform Sampling method. It can be observed that for text, code, and CR, the same upper-bound was reached while utilizing approximately 48%, 45%, and 35% fewer runs, respectively.
  • Figure 2: Maximum search space for methods result on MATH- hard (* 280 test subset). From above, the Venn diagram's $B \cup C \ - A$ represents the proportion of the search space that method A fails to explore.
  • Figure 3: Maximum search space for methods result on MATH-hard (* 280 test subset): Radar graph for showing the average accuracy per all 7 domains for each method (Text, Code, CR) based on their 21 runs.
  • Figure 4: Maximum search space for methods result on MATH-hard-4doms (* 400 test subset): Data sampling details are written in the section above.