Table of Contents
Fetching ...

Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement

Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

The paper addresses the bottleneck of multi-step reasoning in large language models by shifting focus from reasoning side to problem formulation. It introduces Self-Polish, a problem-refinement prompting framework with zero-shot, in-context, automatic, and complexity-based variants, plus a progressively refining framework that iterates until convergence. Across five benchmarks and multiple models, SP yields consistent improvements and enhances robustness, and it is shown to be complementary (orthogonal) to existing reasoning-side prompting methods. The work demonstrates that refining the input problem can substantially ease subsequent reasoning, offering a practical, training-free augmentation to current prompting strategies and a foundation for future exploration of problem-centric reasoning enhancements.

Abstract

To enhance the multi-step reasoning capabilities of large language models, researchers have extensively explored prompting methods, notably the Chain-of-Thought (CoT) method which explicitly elicits human-like rationales. However, they have inadvertently overlooked the potential of enhancing model reasoning performance by formulating higher-quality problems. In this work, we start from the problem side and propose Self-Polish (SP), a novel method that facilitates the model's reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable. We also explore several automatic prompting varients and propose the Self-Polish prompt bank for the community. SP is orthogonal to all other prompting methods of answer/reasoning side like CoT, allowing for seamless integration with state-of-the-art techniques for further improvement. Thorough experiments show that the proposed method attains notable and consistent effectiveness on five reasoning benchmarks across different models. Furthermore, our method also showcases impressive performance on robustness evaluation. Codes and prompts are available at https://github.com/WooooDyy/Self-Polish.

Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement

TL;DR

The paper addresses the bottleneck of multi-step reasoning in large language models by shifting focus from reasoning side to problem formulation. It introduces Self-Polish, a problem-refinement prompting framework with zero-shot, in-context, automatic, and complexity-based variants, plus a progressively refining framework that iterates until convergence. Across five benchmarks and multiple models, SP yields consistent improvements and enhances robustness, and it is shown to be complementary (orthogonal) to existing reasoning-side prompting methods. The work demonstrates that refining the input problem can substantially ease subsequent reasoning, offering a practical, training-free augmentation to current prompting strategies and a foundation for future exploration of problem-centric reasoning enhancements.

Abstract

To enhance the multi-step reasoning capabilities of large language models, researchers have extensively explored prompting methods, notably the Chain-of-Thought (CoT) method which explicitly elicits human-like rationales. However, they have inadvertently overlooked the potential of enhancing model reasoning performance by formulating higher-quality problems. In this work, we start from the problem side and propose Self-Polish (SP), a novel method that facilitates the model's reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable. We also explore several automatic prompting varients and propose the Self-Polish prompt bank for the community. SP is orthogonal to all other prompting methods of answer/reasoning side like CoT, allowing for seamless integration with state-of-the-art techniques for further improvement. Thorough experiments show that the proposed method attains notable and consistent effectiveness on five reasoning benchmarks across different models. Furthermore, our method also showcases impressive performance on robustness evaluation. Codes and prompts are available at https://github.com/WooooDyy/Self-Polish.
Paper Structure (46 sections, 9 figures, 5 tables, 1 algorithm)

This paper contains 46 sections, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Schematic comparison between Self-Polish and other representative approaches for reasoning with prompting. Previous paradigms enhance the reasoning capability of LLMs from the aspect of the answer side/reasoning side, while our method starts from the problem side, and refines problems to be simpler and more comprehensible for models.
  • Figure 2: An example illustrating the framework and problem-refining patterns of Self-Polish. In the first refining iteration, the irrelevant information "Ada bought 2000 tomatoes from the grocery store." is removed. In the second iteration, the conditions are reordered for easier calculation of the number of beads required for each type of beaded product. In the third iteration, the local conditions were parallelly combined to form new conditions (the total number of beads required for necklaces and bracelets).
  • Figure 3: Evaluating Self-Polish on various benchmarks with different models. Self-Polish consistently improves reasoning performance across multiple models and benchmarks.
  • Figure 4: Evaluation results on GSMIC DBLP:journals/corr/abs-2302-00093. Self-Polish (SP) enhances the robustness and reliability of various models when combined with different prompting techniques.
  • Figure 5: Ablation studies and the distribution of actual iterating times. (a) and (c) illustrate the performance (vertical axis on the left) when using different final answer selection strategies and different max iterating times $T$. The "Converge" means the performance calculated by $N_{conv}/N_{all}$ where $N_{conv}$ means the number of examples that are answered correctly with converged answers, while the $N_{all}$ means the number of all test examples. We also incorporate a line to represent the average actual iteration times at each value of $T$ (vertical axis on the right). In (b) and (d), we show the distribution of actual iterating times when we set $T=5$.
  • ...and 4 more figures