Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

Dharunish Yugeswardeenoo; Kevin Zhu; Sean O'Brien

Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

Dharunish Yugeswardeenoo, Kevin Zhu, Sean O'Brien

TL;DR

Question-Analysis Prompting (QAP) introduces a zero-shot prompting strategy that requires an LLM to explain the question in at least $n$ words before solving, enabling explicit interpretation of the task. Across GPT-3.5 Turbo and GPT-4 Turbo, QAP variants outperform state-of-the-art prompts on AQuA and SAT and achieve top-2 performance in about 75% of tests, with longer explanations helping harder problems but potentially hindering easier ones. The approach emphasizes the model's understanding of the question and demonstrates the importance of response length in reasoning tasks, highlighting a trade-off between depth of analysis and simplicity of the final answer. The work suggests combining QAP with other prompting and decoding strategies and extending to multi-modal tasks, while acknowledging limitations in prompt sensitivity and dataset/model scope.

Abstract

Although LLMs have the potential to transform many fields, they still underperform humans in reasoning tasks. Existing methods induce the model to produce step-by-step calculations, but this research explores the question: Does making the LLM analyze the question improve its performance? We propose a novel prompting strategy called Question Analysis Prompting (QAP), in which the model is prompted to explain the question in $n$ words before solving. The value of $n$ influences the length of response generated by the model. QAP is evaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and commonsense dataset StrategyQA. QAP is compared with other state-of-the-art prompts including Chain-of-Thought (CoT), Plan and Solve Prompting (PS+) and Take A Deep Breath (TADB). QAP outperforms all state-of-the-art prompts on AQuA and SAT datasets on both GPT3.5 and GPT4. QAP consistently ranks among the top-2 prompts on 75\% of the tests. A key factor of QAP performance can be attributed to response length, where detailed responses are beneficial when answering harder questions, but can negatively affect easy questions.

Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

TL;DR

Question-Analysis Prompting (QAP) introduces a zero-shot prompting strategy that requires an LLM to explain the question in at least

words before solving, enabling explicit interpretation of the task. Across GPT-3.5 Turbo and GPT-4 Turbo, QAP variants outperform state-of-the-art prompts on AQuA and SAT and achieve top-2 performance in about 75% of tests, with longer explanations helping harder problems but potentially hindering easier ones. The approach emphasizes the model's understanding of the question and demonstrates the importance of response length in reasoning tasks, highlighting a trade-off between depth of analysis and simplicity of the final answer. The work suggests combining QAP with other prompting and decoding strategies and extending to multi-modal tasks, while acknowledging limitations in prompt sensitivity and dataset/model scope.

Abstract

words before solving. The value of

influences the length of response generated by the model. QAP is evaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and commonsense dataset StrategyQA. QAP is compared with other state-of-the-art prompts including Chain-of-Thought (CoT), Plan and Solve Prompting (PS+) and Take A Deep Breath (TADB). QAP outperforms all state-of-the-art prompts on AQuA and SAT datasets on both GPT3.5 and GPT4. QAP consistently ranks among the top-2 prompts on 75\% of the tests. A key factor of QAP performance can be attributed to response length, where detailed responses are beneficial when answering harder questions, but can negatively affect easy questions.

Paper Structure (22 sections, 8 figures, 6 tables)

This paper contains 22 sections, 8 figures, 6 tables.

Introduction
Prompt Design
Prompt Impact
Experimental Setup
Benchmarks
Models
Prompts
Results
Analysis
Additional Studies
Related Work
Conclusion
Limitations
Ethics
Appendix
...and 7 more sections

Figures (8)

Figure 1: Example of QAP prompting - shows how the prompt triggers explanation of the question followed by an approach to solve the problem, detailed steps, finally leading to correct answer
Figure 2: We consider difficulty of the problem based on baseline's results. E.g., an incorrect answer is "hard" and a correct answer is "easy". Left chart shows accuracy within each difficulty. Right chart shows mean (average) word count for within each difficulty. All results for each prompt are shown in Table \ref{['tab:count']} and Table \ref{['tab:score']}
Figure 3: Examples of QAP inducing explanations of the question on GSM8K, AQuA, and StrategyQA. The prompts include QAP50, QAP150, QAP50 respectively. Pink highlights key phrases (math reasoning) and orange highlights represents useful background information (commonsense reasoning).
Figure 4: This comparison shows how responses vary when changing n. This is only the answer portion. This was experimented on QAP50 and QAP20 on GSM8K on AQuA. Blue represents a QAP200 section which provides more detail than QAP100’s (Red) response on the same step. Green represents a section that QAP200 had that QAP100 did not have at all.
Figure 5: Example in which over-explanation can negatively impact a response. QAP50 acquires the correct answer (34), but QAP200 does not. In fact, QAP200 reaches the correct answer, but additional explanation leads to a wrong answer.
...and 3 more figures

Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

TL;DR

Abstract

Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)